Thrust v1.9.4 Release NotesRelease Date: 2019-03-01 // over 3 years ago
🔀 Thrust 1.9.4 adds asynchronous interfaces for parallel algorithms, a new allocator system including caching allocators and unified memory support, as well as a variety of other enhancements, mostly related to C++11/C++14/C++17/C++20 support. The new asynchronous algorithms in the
thrust::futureobjects, which can be waited upon to synchronize with the completion of the parallel operation.
💥 Breaking API Changes
🔀 Synchronous Thrust algorithms now block until all of their operations have completed. Use the new asynchronous Thrust algorithms for non-blocking behavior.
🆕 New Features
thrust::future<T>, uniquely-owned asynchronous handles consisting of a state (ready or not ready), content (some value; for
thrust::futureonly), and an optional set of objects that should be destroyed only when the future's value is ready and has been consumed.
- The design is loosely based on C++11's
- They can be
.wait'd on, and the value of a future can be waited on and retrieved with
thrust::futures can be combined with
thrust::futures can be converted to
- Currently, these primitives are only implemented for the CUDA backend and are C++11 only.
🆕 New asynchronous algorithms that return
thrust::futures, implemented as C++20 range style customization points:
thrust::async::reduce_into, which takes a target location to store the reduction result into.
thrust::async::copy, including a two-policy overload that allows explicit cross system copies which execution policy properties can be attached to.
- By default the asynchronous algorithms use the new caching allocators. Deallocation of temporary storage is deferred until the destruction of the returned
thrust::future. The content of
thrust::futures is stored in either device or universal memory and transferred to the host only upon request to prevent unnecessary data migration.
- Asynchronous algorithms are currently only implemented for the CUDA system and are C++11 only.
exec.after(f, g, ...), a new execution policy method that takes a set of
thrust::futures and returns an execution policy that operations on that execution policy should depend upon.
🆕 New logic and mindset for the type requirements for cross-system sequence copies (currently only used by
thrust::async::copy), based on:
THRUST_PROCLAIM_CONTIGUOUS_ITERATORfor detecting/indicating that an iterator points to contiguous storage.
THRUST_PROCLAIM_TRIVIALLY_RELOCATABLEfor detecting/indicating that a type is
memcpyable (based on principles from https://wg21.link/P1144).
- The new approach reduces buffering, increases performance, and increases correctness.
- The fast path is now enabled when copying fp16 and CUDA vector types with
🐎 All Thrust synchronous algorithms for the CUDA backend now actually synchronize. Previously, any algorithm that did not allocate temporary storage (counterexample:
thrust::sort) and did not have a computation-dependent result (counterexample:
thrust::reduce) would actually be launched asynchronously. Additionally, synchronous algorithms that allocated temporary storage would become asynchronous if a custom allocator was supplied that did not synchronize on allocation/deallocation, unlike
cudaFree. So, now
thrust::sort, etc are truly synchronous. In some cases this may be a performance regression; if you need asynchrony, use the new asynchronous algorithms.
👍 Thrust's allocator framework has been rewritten. It now uses a memory resource system, similar to C++17's
std::pmrbut supporting static polymorphism. Memory resources are objects that allocate untyped storage and allocators are cheap handles to memory resources in this new model. The new facilities live in
thrust::mr::memory_resource<Pointer>, the memory resource base class, which takes a (possibly tagged) pointer to
voidtype as a parameter.
thrust::mr::allocator<T, MemoryResource>, an allocator backed by a memory resource object.
thrust::mr::polymorphic_adaptor_resource<Pointer>, a type-erased memory resource adaptor.
thrust::mr::polymorphic_allocator<T>, a C++17-style polymorphic allocator backed by a type-erased memory resource object.
- New tunable C++17-style caching memory resources,
thrust::mr::(disjoint_)?(un)?synchronized_pool_resource, designed to cache both small object allocations and large repetitive temporary allocations. The disjoint variants use separate storage for management of the pool, which is necessary if the memory being allocated cannot be accessed on the host (e.g. device memory).
- System-specific allocators were rewritten to use the new memory resource framework.
thrust::device_memory_resourcefor allocating device memory.
thrust::universal_memory_resourcefor allocating memory that can be accessed from both the host and device (e.g.
thrust::universal_host_pinned_memory_resourcefor allocating memory that can be accessed from the host and the device but always resides in host memory (e.g.
thrust::per_device_allocator, which lazily create and retrieve a per-device singleton memory resource.
- Rebinding mechanisms (
thrust::device_make_unique, a factory function for creating a
std::unique_ptrto a newly allocated object in device memory.
<thrust/detail/memory_algorithms>, a C++11 implementation of the C++17 uninitialized memory algorithms.
thrust::allocate_uniqueand friends, based on the proposed C++23
New type traits and metaprogramming facilities. Type traits are slowly being migrated out of
<thrust/detail/*>; their new home will be
thrust::is_operator_less_or_greater_function_object, which detects
, which detectsthrust::plus
thrust::remove_cvref(_t)?, a C++11 implementation of C++20's
thrust::void_t, and various other new type traits.
thrust::integer_sequenceand friends, a C++11 implementation of C++20's
thrust::disjunction, a C++11 implementation of C++17's logical metafunctions.
- Some Thrust type traits (such as
thrust::is_constructible) have been redefined in terms of C++11's type traits when they are available.
thrust::optional, a C++11 implementation of C++17's
thrust::addressof, an implementation of C++11's
thrust::prev, an implementation of C++11's
<functional>style unary function object that multiplies its argument by itself.
thrust::numeric_limits, a customized version of
<thrust/detail/preprocessor.h>, new general purpose preprocessor facilities:
THRUST_PP_CAT[2-5], concatenates two to five tokens.
THRUST_PP_EXPAND(_ARGS)?, performs double expansion.
THRUST_PP_DISPATCH, tools for macro overloading.
THRUST_PP_BOOL, boolean conversion.
THRUST_PP_HEAD, a variadic macro that expands to the first argument.
THRUST_PP_TAIL, a variadic macro that expands to all its arguments after the first.
THRUST_PP_IIF, bitwise conditional.
THRUST_PP_HAS_COMMA, facilities for adding and detecting comma tokens.
THRUST_PP_IS_VARIADIC_NULLARY, returns true if called with a nullary
THRUST_CURRENT_FUNCTION, expands to the name of the current function.
🆕 New C++11 compatibility macros:
THRUST_NODISCARD, expands to
[[nodiscard]]when available and the best equivalent otherwise.
THRUST_CONSTEXPR, expands to
constexprwhen available and the best equivalent otherwise.
THRUST_OVERRIDE, expands to
overridewhen available and the best equivalent otherwise.
THRUST_DEFAULT, expands to
= default;when available and the best equivalent otherwise.
THRUST_NOEXCEPT, expands to
noexceptwhen available and the best equivalent otherwise.
THRUST_FINAL, expands to
finalwhen available and the best equivalent otherwise.
THRUST_INLINE_CONSTANT, expands to
inline constexprwhen available and the best equivalent otherwise.
<thrust/detail/type_deduction.h>, new C++11-only type deduction helpers:
THRUST_DECLTYPE_RETURNS*, expand to function definitions with suitable conditional
noexceptqualifiers and trailing return types.
THRUST_FWD(x), expands to
THRUST_MVCAP, expands to a lambda move capture.
THRUST_RETOF, expands to a decltype computing the return type of an invocable.
🆕 New Examples
mr_basicdemonstrates how to use the new memory resource allocator system.
- 🏷 Tagged pointer enhancements:
nullptrsupport to Thrust tagged pointers.
explicit operator boolfor Thrust tagged pointers when using C++11 for
thrust::static_pointer_castfor casting Thrust tagged pointers.
- Iterator enhancements:
thrust::iterator_systemis now SFINAE friendly.
- Removed cv qualifiers from iterator types when using
- Static assert enhancements:
THRUST_STATIC_ASSERT_MSG, takes an optional string constant to be used as the error message when possible.
THRUST_STATIC_ASSERT(_MSG)to use C++11's
static_assertwhen it's available.
- Introduce a way to test for static assertions.
- ✅ Testing enhancements:
- Additional scalar and sequence types, including non-builtin types and vectors with unified memory allocators, have been added to the list of types used by generic unit tests.
- The generation of random input data has been improved to increase the range of values used and catch more corner cases.
truncate_to_max_representableutility for avoiding the generation of ranges that cannot be represented by the underlying element type in generic unit test code.
- The test driver now synchronizes with CUDA devices and check for errors after each test, when switching devices, and after each raw kernel launch.
- The warningtester uber header is now compiled with NVCC to avoid needing to disable CUDA-specific code with the preprocessor.
- Fixed the unit test framework's
DECLARE_INTEGRAL_VARIABLE_UNITTESTtest declaration macro.
DECLARE_VARIABLE_UNITTEST_WITH_TYPES_AND_NAMEtest declaration macro.
thrust::system_errorin the CUDA backend now print out its
cudaError_tenumerator in addition to the diagnostic message.
- Stopped using conditionally signed types like
🐛 Bug Fixes
- #897, 2062242 Fix compilation error when using
- #908, 2089386 Static assert that
thrust::fillisn't operate on const iterators.
- #919 Fix compilation failure with
- ⚙ #924, 2096679, 2315990 Fix dispatch for the CUDA backend's
thrust::reduceto use two functions (one with the pragma for disabling exec checks, one with THRUST_RUNTIME_FUNCTION) instead of one. This fixes a regression with device compilation that started in CUDA 9.2.
- #928, 2341455 Add missing
__host__ __device__annotations to a
thrust::complex::operator=to satisfy GoUDA.
- 0️⃣ 2094642 Make
thrust::vector_base::clearnot depend on the element type being default constructible.
- 2289115 Remove flaky
- 2328572 Add missing
thrust::device_vectorconstructor that takes an allocator parameter.
- ⚡️ 2455740 Update the
range_viewexample to not use device-side launch.
- ✅ 2455943 Ensure that sized unit tests that use
counting_iteratorperform proper truncation.
- 🔨 2455952 Refactor questionable
- The design is loosely based on C++11's