CUB v1.9.8 Release Notes

Release Date: 2020-05-19 // almost 4 years ago
  • Summary

    ๐Ÿš€ CUB 1.9.8 is the first release of CUB to be officially supported and included in the CUDA Toolkit.
    ๐ŸŽ When compiling CUB in C++11 mode, CUB now caches calls to CUDA attribute query APIs, which improves performance of these queries by 20x to 50x when they are called concurrently by multiple host threads.

    โœจ Enhancements

    • (C++11 or later) Cache calls to cudaFuncGetAttributes and cudaDeviceGetAttribute within cub::PtxVersion and cub::SmVersion. These CUDA APIs acquire locks to CUDA driver/runtime mutex and perform poorly under contention; with the caching, they are 20 to 50x faster when called concurrently. Thanks to Bilge Acun for bringing this issue to our attention.
    • DispatchReduce now takes an OutputT template parameter so that users can specify the intermediate type explicitly.
    • ๐ŸŽ Radix sort tuning policies updates to fix performance issues for element types smaller than 4 bytes.

    ๐Ÿ› Bug Fixes

    • ๐Ÿ’… Change initialization style from copy initialization to direct initialization (which is more permissive) in AgentReduce to allow a wider range of types to be used with it.
    • ๐Ÿ›  Fix bad signed/unsigned comparisons in WarpReduce.
    • ๐Ÿ›  Fix computation of valid lanes in warp-level reduction primitive to correctly handle the case where there are 0 input items per warp.