PyTorch v0.4.0 Release Notes
Release Date: 2018-04-24 // about 6 years ago-
π PyTorch 0.4.0 release notes
Table of Contents
- Major Core Changes
- Tensor / Variable merged
- Zero-dimensional Tensors
- dtypes
- migration guide
- π New Features
- Tensors
- Full support for advanced indexing
- Fast Fourier Transforms
- Neural Networks
- Trade-off memory for compute
- bottleneck - a tool to identify hotspots in your code
- torch.distributions
- 24 basic probability distributions
- Added cdf, variance, entropy, perplexity etc.
- Distributed Training
- Launcher utility for ease of use
- NCCL2 backend
- C++ Extensions
- Windows Support
- ONNX Improvements
- RNN support
- π Performance improvements
- π Bug fixes
Major Core changes
β‘οΈ Here is a summary of the updates to the most important core features users will use daily.
Major Changes and Potentially Breaking Changes:
- π
Tensors
andVariables
have merged - Some operations now return 0-dimensional (scalar)
Tensors
- π Deprecation of the
volatile
flag
π Improvements:
- π
dtypes
,devices
, and Numpy-styleTensor
creation functions added - π Support for writing device-agnostic code
We wrote a migration guide that should help you transition your code to new APIs and style. Please read it if you have code in a previous version of PyTorch that you would like to migrate.
Please read the migration guide if you have code in a previous version of PyTorch that you would like to migrate.
Please read the migration guide if you have code in a previous version of PyTorch that you would like to migrate.
Please read the migration guide if you have code in a previous version of PyTorch that you would like to migrate.The contents of this section (Major Core changes) are included in the migration guide.
π Merging
Tensor
andVariable
classesπ
torch.autograd.Variable
andtorch.Tensor
are now the same class. More precisely,torch.Tensor
is capable of tracking history and behaves like the oldVariable
;Variable
wrapping continues to work as before but returns an object of typetorch.Tensor
. This means that you don't need theVariable
wrapper everywhere in your code anymore.π The
type()
of aTensor
has changedNote also that the
type()
of a Tensor no longer reflects the data type. Useisinstance()
orx.type()
instead:\>\>\> x = torch.DoubleTensor([1, 1, 1])\>\>\> print(type(x)) # was torch.DoubleTensor\<class 'torch.autograd.variable.Variable'\>\>\>\> print(x.type()) # OK: 'torch.DoubleTensor''torch.DoubleTensor'\>\>\> print(isinstance(x, torch.DoubleTensor)) # OK: TrueTrue
π When does
autograd
start tracking history now?π
requires_grad
, the central flag forautograd
, is now an attribute onTensor
s. Let's see how this change manifests in code.π
autograd
uses the same rules previously used forVariable
s. It starts tracking history when any inputTensor
of an operation hasrequires_grad=True
. For example,\>\>\> x = torch.ones(1) # create a tensor with requires\_grad=False (default)\>\>\> x.requires\_gradFalse\>\>\> y = torch.ones(1) # another tensor with requires\_grad=False\>\>\> z = x + y\>\>\> # both inputs have requires\_grad=False. so does the output\>\>\> z.requires\_gradFalse\>\>\> # then autograd won't track this computation. let's verify!\>\>\> z.backward()RuntimeError: element 0 of tensors does not require grad and does not have a grad\_fn\>\>\>\>\>\> # now create a tensor with requires\_grad=True\>\>\> w = torch.ones(1, requires\_grad=True)\>\>\> w.requires\_gradTrue\>\>\> # add to the previous result that has require\_grad=False\>\>\> total = w + z\>\>\> # the total sum now requires grad!\>\>\> total.requires\_gradTrue\>\>\> # autograd can compute the gradients as well\>\>\> total.backward()\>\>\> w.grad tensor([1.])\>\>\> # and no computation is wasted to compute gradients for x, y and z, which don't require grad\>\>\> z.grad == x.grad == y.grad == NoneTrue
Manipulating
requires_grad
flagOther than directly setting the attribute, you can change this flag in-place using
my_tensor.requires_grad_(requires_grad=True)
, or, as in the above example, at creation time by passing it in as an argument (default isFalse
), e.g.,\>\>\> existing\_tensor.requires\_grad\_()\>\>\> existing\_tensor.requires\_gradTrue\>\>\> my\_tensor = torch.zeros(3, 4, requires\_grad=True)\>\>\> my\_tensor.requires\_gradTrue
What about
.data
?π
.data
was the primary way to get the underlyingTensor
from aVariable
. After this merge, callingy = x.data
still has similar semantics. Soy
will be aTensor
that shares the same data withx
, is unrelated with the computation history ofx
, and hasrequires_grad=False
.π However,
.data
can be unsafe in some cases. Any changes onx.data
wouldn't be tracked byautograd
, and the computed gradients would be incorrect ifx
is needed in a backward pass. A safer alternative is to usex.detach()
, which also returns aTensor
that shares data withrequires_grad=False
, but will have its in-place changes reported byautograd
ifx
is needed in backward.Some operations now return 0-dimensional (scalar)
Tensors
Previously, indexing into a
Tensor
vector (1-dimensional tensor) gave a Python number but indexing into aVariable
vector gave (incosistently!) a vector of size(1,)
! Similar behavior existed with reduction functions, i.e.tensor.sum()
would return a Python number, butvariable.sum()
would retun a vector of size(1,)
.π Fortunately, this release introduces proper scalar (0-dimensional tensor) support in PyTorch! Scalars can be created using the new
torch.tensor
function (which will be explained in more detail later; for now just think of it as the PyTorch equivalent ofnumpy.array
). Now you can do things like:\>\>\> torch.tensor(3.1416) # create a scalar directlytensor(3.1416)\>\>\> torch.tensor(3.1416).size() # scalar is 0-dimensionaltorch.Size([])\>\>\> torch.tensor([3]).size() # compare to a vector of size 1torch.Size([1])\>\>\>\>\>\> vector = torch.arange(2, 6) # this is a vector\>\>\> vector tensor([2., 3., 4., 5.])\>\>\> vector.size() torch.Size([4])\>\>\> vector[3] # indexing into a vector gives a scalartensor(5.)\>\>\> vector[3].item() # .item() gives the value as a Python number5.0\>\>\> sum = torch.tensor([2, 3]).sum()\>\>\> sumtensor(5)\>\>\> sum.size() torch.Size([])
Accumulating losses
β Consider the widely used pattern
total_loss += loss.data[0]
before 0.4.0.loss
was aVariable
wrapping a tensor of size(1,)
, but in 0.4.0loss
is now a scalar and has0
dimensions. Indexing into a scalar doesn't make sense (it gives a warning now, but will be a hard error in 0.5.0): useloss.item()
to get the Python number from a scalar.Note that if you don't convert to a Python number when accumulating losses, you may find increased memory usage in your program. This is because the right-hand-side of the above expression used to be a Python float, while it is now a zero-dim Tensor. The total loss is thus accumulating Tensors and their gradient history, which may keep around large autograd graphs for much longer than necessary.
π Deprecation of
volatile
flagThe
volatile
flag is now deprecated and has no effect. Previously, any computation that involves aVariable
withvolatile=True
won't be tracked byautograd
. This has now been replaced by a set of more flexible context managers includingtorch.no_grad()
,torch.set_grad_enabled(grad_mode)
, and others.\>\>\> x = torch.zeros(1, requires\_grad=True)\>\>\> with torch.no\_grad():... y = x \* 2\>\>\> y.requires\_gradFalse\>\>\>\>\>\> is\_train = False\>\>\> with torch.set\_grad\_enabled(is\_train):... y = x \* 2\>\>\> y.requires\_gradFalse\>\>\> torch.set\_grad\_enabled(True) # this can also be used as a function\>\>\> y = x \* 2\>\>\> y.requires\_gradTrue\>\>\> torch.set\_grad\_enabled(False)\>\>\> y = x \* 2\>\>\> y.requires\_gradFalse
π
dtypes
,devices
and NumPy-style creation functionsIn previous versions of PyTorch, we used to specify data type (e.g. float vs double), device type (cpu vs cuda) and layout (dense vs sparse) together as a "tensor type". For example,
torch.cuda.sparse.DoubleTensor
was theTensor
type respresentingdouble
data type, living on CUDA devices, and with COO sparse tensor layout.π In this release, we introduce
torch.dtype
,torch.device
andtorch.layout
classes to allow better management of these properties via NumPy-style creation functions.π
torch.dtype
π Below is a complete list of available
torch.dtype
s (data types) and their corresponding tensor types.Data type torch.dtype
Tensor types 32-bit floating point torch.float32
ortorch.float
torch.*.FloatTensor
64-bit floating point torch.float64
ortorch.double
torch.*.DoubleTensor
16-bit floating point torch.float16
ortorch.half
torch.*.HalfTensor
8-bit integer (unsigned) torch.uint8
torch.*.ByteTensor
8-bit integer (signed) torch.int8
torch.*.CharTensor
16-bit integer (signed) torch.int16
ortorch.short
torch.*.ShortTensor
32-bit integer (signed) torch.int32
ortorch.int
torch.*.IntTensor
64-bit integer (signed) torch.int64
ortorch.long
torch.*.LongTensor
0οΈβ£ Use
torch.set_default_dtype
andtorch.get_default_dtype
to manipulate defaultdtype
for floating point tensors.π
torch.device
A
torch.device
contains a device type ('cpu'
or'cuda'
) and optional device ordinal (id) for the device type. It can be initilized withtorch.device('{device_type}')
ortorch.device('{device_type}:{device_ordinal}')
.If the device ordinal is not present, this represents the current device for the device type; e.g.,
torch.device('cuda')
is equivalent totorch.device('cuda:X')
whereX
is the result oftorch.cuda.current_device()
.π
torch.layout
π
torch.layout
represents the data layout of aTensor
. Currentlytorch.strided
(dense tensors) andtorch.sparse_coo
(sparse tensors with COO format) are supported.Creating
Tensor
sπ Methods that create a
Tensor
now also take indtype
,device
,layout
, andrequires_grad
options to specify the desired attributes on the returnedTensor
. For example,\>\>\> device = torch.device("cuda:1")\>\>\> x = torch.randn(3, 3, dtype=torch.float64, device=device) tensor([[-0.6344, 0.8562, -1.2758], [0.8414, 1.7962, 1.0589], [-0.1369, -1.0462, -0.4373]], dtype=torch.float64, device='cuda:1')\>\>\> x.requires\_grad # default is FalseFalse\>\>\> x = torch.zeros(3, requires\_grad=True)\>\>\> x.requires\_gradTrue
π
torch.tensor
torch.tensor
is one of the newly added tensor creation methods. It takes in array like data of all kinds and copies the contained values into a newTensor
. As mentioned earlier,torch.tensor
is the PyTorch equivalent of NumPy'snumpy.array
constructor. Unlike thetorch.*Tensor
methods, you can also create zero-dimensionalTensor
s (aka scalars) this way (a single python number is treated as a Size in thetorch.*Tensor
methods). Moreover, if adtype
argument isn't given, it will infer the suitabledtype
given the data. It is the recommended way to create a tensor from existing data like a Python list. For example,\>\>\> cuda = torch.device("cuda")\>\>\> torch.tensor([[1], [2], [3]], dtype=torch.half, device=cuda) tensor([[1], [2], [3]], device='cuda:0')\>\>\> torch.tensor(1) # scalartensor(1)\>\>\> torch.tensor([1, 2.3]).dtype # type inferecetorch.float32\>\>\> torch.tensor([1, 2]).dtype # type inferecetorch.int64
We've also added more tensor creation methods. Some of them have
torch.*_like
and/ortensor.new_*
variants.0οΈβ£
torch.*_like
takes in an inputTensor
instead of a shape. It returns aTensor
with same attributes as the inputTensor
by default unless otherwise specified:\>\>\> x = torch.randn(3, dtype=torch.float64)\>\>\> torch.zeros\_like(x) tensor([0., 0., 0.], dtype=torch.float64)\>\>\> torch.zeros\_like(x, dtype=torch.int) tensor([0, 0, 0], dtype=torch.int32)
tensor.new_*
can also createTensor
s with same attributes astensor
, but it always takes in a shape argument:\>\>\> x = torch.randn(3, dtype=torch.float64)\>\>\> x.new\_ones(2) tensor([1., 1.], dtype=torch.float64)\>\>\> x.new\_ones(4, dtype=torch.int) tensor([1, 1, 1, 1], dtype=torch.int32)
To specify the desired shape, you can either use a tuple (e.g.,
torch.zeros((2, 3))
) or variable arguments (e.g.,torch.zeros(2, 3)
) in most cases.Name Returned Tensor
torch.*_like
varianttensor.new_*
variantπ torch.empty
unintialized memory β π torch.zeros
all zeros β π torch.ones
all ones β π torch.full
filled with a given value β π torch.rand
i.i.d. continuous Uniform[0, 1)
β π torch.randn
i.i.d. Normal(0, 1)
β π torch.randint
i.i.d. discrete Uniform in given range β π torch.randperm
random permutation of {0, 1, ..., n - 1}
π torch.tensor
copied from existing data ( list
, NumPyndarray
, etc.)π torch.from_numpy
*from NumPy ndarray
(sharing storage without copying)π torch.arange
,π
torch.range
, and
πtorch.linspace
| uniformly spaced values in a given range | | | π |torch.logspace
| logarithmically spaced values in a given range | | | π |torch.eye
| identity matrix | | |*:
torch.from_numpy
only takes in a NumPyndarray
as its input argument.Writing device-agnostic code
Previous versions of PyTorch made it difficult to write code that was device agnostic (i.e. that could run on both CUDA-enabled and CPU-only machines without modification).
PyTorch 0.4.0 makes this easier in two ways:
- The
device
attribute of a Tensor gives thetorch.device
for all Tensors (get_device
only works for CUDA tensors) - π The
to
method ofTensors
andModules
can be used to easily move objects to different devices (instead of having to callcpu()
orcuda()
based on the context)
We recommend the following pattern:
# at beginning of the scriptdevice = torch.device("cuda:0" if torch.cuda.is\_available() else "cpu")...# then whenever you get a new Tensor or Module# this won't copy if they are already on the desired deviceinput = data.to(device) model = MyModule(...).to(device)
Tensors
π Full support for Advanced indexing
π PyTorch now has full support for advanced indexing, following numpy's advanced indexing rules. The following examples are now possible:
a = torch.rand(10, 10, 10, 10)# the indexing elements can have other shapes than 1b = a[[[3, 2]], :, [[1, 3]]]# broadcasting also supported in the indices, as well as lists,# negative indices, slices, elipses, numbersc = a[[1, -2], 2:4, :, [1]]# can also support tensors as indicesindex = torch.tensor([2, 4]) d = a[index]# and the indices can be on the GPU# or CPUe = a[index.cuda()] f = a.cuda()[index] mask = torch.rand(10) \> 0.5# we can now index with a mask that has fewer# dimensions than the indexing tensorc = a[mask, :5]
Fast Fourier Transform
- β Add new FFT methods #5856
- β Add
torch.stft
(short time Fourier transform) and hann/hamming/bartlett window functions. #4095 - π Support arbitrary number of batch dimensions in *FFT #6528
π New and updated Torch operators
- β Added
torch.log2
andtorch.log10
#6272 - β Added
torch.isnan
#5273 - β Add
torch.reshape
, which is similar tonumpy.reshape
. It is roughly equivalent totensor.contiguous().view()
, but avoids copying in certain cases #5575 - β Add CPU implementation of
torch.unique
, which outputs the unique elements of a Tensor #5503 - β Add
torch.det
,torch.logdet
andtorch.slogdet
, for computing the (log-)determinant of square 2D tensors. For negative determinants,torch.logdet
returnsnan
, whiletorch.slogdet
returns the sign of the log-determinant and the log of the absolute value of the determinant. #3816 and #5393 - β Add
nn.functional.gumbel_softmax
, which lets you use the reparametrization trick for discrete variables #3341 - β Add
torch.take
andTensor.put_
. Those functions are equivalent to numpy.take and numpy.put, and are the base for full support of advanced indexing in PyTorch #3263 - β Add
torch.randint
, similar tonumpy.random.randint
#6136 - β Add
torch.diagonal
andtorch.diagflat
, similar tonumpy.diagonal
andnumpy.diagflat
. They are meant as a replacement fortorch.diag
, which handled both the cases of constructing a diagonal tensor as well as extracting the diagonal of a matrix #5622 β Add
torch.einsum
, equivalent tonumpy.einsum
. einsum allows you to perform operations using Einstein's notation. #5503a = torch.arange(0, 9).reshape(3, 3)# the following transposes ab = torch.einsum('ij->ji', (a,))
β Add
torch.expm1
, a numerically stableexp(x)-1
for smallx
. #4350π Allow users to specify individual split sizes with
torch.split
#3837β Add
torch.where(condition, tensor1, tensor2)
that returns a tensors of elements selected fromtensor1
ortensor2
based oncondition
. #4259, #4259β Add
Tensor.norm(dim)
for sparse tensors. #4882Implement
torch.neg
for all types. #4075Implement gradient calculation for
torch.trtrs
. #3972Deprecate out-of-place
Tensor.resize
andTensor.resize_as
. These have weird semantics and are hard to use correctly. Please use their in-place variantsTensor.resize_
andTensor.resize_as_
. #4886
π Rename
async
argument in.cuda()
tonon_blocking
π The
async
keyword argument in conversion calls is now deprecated in PyTorch, and it has been replaced bynon_blocking
. This was necessary becauseasync
will be a keyword in Python 3.7Neural Networks
A new autograd container that lets you trade compute for memory
The new
checkpoint
container allows you to only store a subset of the outputs necessary for backpropagation. If an output is missing (to save memory), thecheckpoint
container will recompute the intermediate outputs from the closest checkpoint, so that memory usage can be reduced (with an increase in computation time).
Here is an example:# inputinput = torch.rand(1, 10)# suppose we have a very deep modellayers = [nn.Linear(10, 10) for \_ in range(1000)] model = nn.Sequential(\*layers) output = model(input)
The above model uses a lot of memory, because it needs to keep the intermediate values of every operation for backpropagation.
checkpoint
lets your reduce the memory requirements:# create the input tensors and set the requires\_grad=True# NOTE: the requires\_grad=True for the input is a current# limitation of checkpointing. At least one of the # model inputs should have requires\_grad=True. # If you don't do it, you might have empty gradients.input = torch.rand(1, 10, requires\_grad=True) layers = [nn.Linear(10, 10) for \_ in range(1000)]# define function that will define where# we will checkpoint and store# intermediate gradients. In this case,# we will only store one intermediate# gradient, in the middle of the# modeldef run\_first\_half(\*args): x = args[0] for layer in layers[:500]: x = layer(x) return xdef run\_second\_half(\*args): x = args[0] for layer in layers[500:-1]: x = layer(x) return x# now uses the new checkpoint functionalityfrom torch.utils.checkpoint import checkpoint x = checkpoint(run\_first\_half, input) x = checkpoint(run\_second\_half, x)# last output need to be run without checkpointx = layers[-1](x) x.sum.backward() # works!
For sequential modules (which can have arbitrary blocks inside), a helper function
checkpoint_sequential
is provided, which takes care of the most common use-cases:input = torch.rand(1, 10, requires\_grad=True) layers = [nn.Linear(10, 10) for \_ in range(1000)] model = nn.Sequential(\*layers)from torch.utils.checkpoint import checkpoint\_sequential# split in two blocksnum\_segments = 2x = checkpoint\_sequential(model, num\_segments, input) x.sum().backward() # works!
bottleneck - a tool to identify hotspots in your code
torch.utils.bottleneck
(#5216, #6425) is a tool that can be used as an initial step for
debugging bottlenecks in your program. It summarizes runs of your script with
π the Python profiler and PyTorchβs autograd profiler. See the bottleneck docs for more details.β¬οΈ reduce=False Losses
π As of this release, all of our loss functions support the
reduce
keyword. Specifyingreduce=False
gives a Tensor per unit of loss instead of a single reduced loss. #4924, #5346, #5646, #4231, #4705, #5680π New modules and module improvements
- β Add
DistributedDataParallelCPU
. This is similar toDistributedDataParallel
, but with specific support for models running on the CPU (contrary toDistributedDataParallel
, which targets GPU), and supportsmpi
,gloo
andtcp
backends #5919. - β Add Group Normalization (
nn.GroupNorm
), an alternative to batch normalization that doesn't suffer from the same issues asBatchNorm
for small batch sizes - β Add Layer Normalization (
nn.LayerNorm
), an alternative for batch normalization often used in NLP tasks. #4922 - β Add Local Response Normalization (
nn.LocalResponseNorm
). #4922 - π
MaxPool3d
now supports double backwards. MaxPool3d and MaxUnpool3d now use indices consistent with the rest of the pooling layers. #5328 - π All loss functions now support a reduce argument to return a batch of losses. #264
- β Add util to clip gradient value in torch.nn.utils.clip_grad and add param to He initialization scheme in
torch.nn.init
. #6173 - π Renamed
torch.nn.init.*
methods to have an underscore in the end, as they operate in-place, and deprecate the old versions 6093 - β Added support for returning dictionaries in
DataParallel
#6113 - β Added support for N-D tensors in
torch.nn.Bilinear
#5764 - β Add
Embedding.from_pretrained
factory. This allows to initialize an Embedding layer with an existing tensor, bypassing the initial random initialization of its weights. - You can now slice
nn.Sequential
,nn.ModuleList
, andnn.ParameterList
#4491 - Registered
nn.Module
integer parameters and buffers are now immune tomodule.float()
,module.double()
module.half()
calls. #3820
torch.distributions
π
torch.distributions
has expanded to include 24 basic probability distributions:Bernoulli
,Beta
,Binomial
,Categorical
,Cauchy
,Chi2
,Dirichlet
,Exponential
,FisherSnedecor
,Gamma
,Geometric
,Gumbel
,Laplace
,LogNormal
,Multinomial
,MultivariateNormal
,Normal
,OneHotCategorical
,Pareto
,Poisson
,RelaxedBernoulli
,RelaxedOneHotCategorical
,StudentT
, andUniform
.π The
Distribution
interface has expanded to include many methods including.cdf()
,.icdf()
,.mean()
,.variance()
,.entropy()
, and.perplexity()
. Distributions now split tensor dimensions intosample_shape
+batch_shape
+event_shape
. Most continuous distributions now also implement a differentiable.rsample()
method to compute pathwise derivatives aka the reparameterization trick (check.has_rsample
for availability):\>\>\> loc = torch.tensor(0., requires\_grad=True)\>\>\> scale = torch.tensor(1., requires\_grad=True)\>\>\> samples = Normal(loc, scale).rsample(sample\_shape=(1000,))\>\>\> loss = (samples - 0.5).pow(4).mean() # average over 1000 monte carlo samples\>\>\> grad(loss, [loc, scale]) (tensor(-7.5092), tensor(15.2704))
π Most discrete distributions implement an
.enumerate_support()
method to make it easy to sum over all possible sample values (check.has_enumerate_support
for availability).π
kl_divergence
is defined for many pairs of distributions, e.g.\>\>\> x = torch.tensor(1.0, requires\_grad=True)\>\>\> kl = kl\_divergence(Uniform(-x, x), Normal(0., 1.))\>\>\> grad(kl, [x])[0] tensor(-0.6667)
Distribution Transforms
π New distributions can be created by combining
TransformedDistribution
with any number ofTransform
objects from thetorch.distributions.transforms
library, including:ExpTransform
,PowerTransform
,SigmoidTransform
,AbsTransform
,AffineTransform
,SoftmaxTransform
,StickBreakingTransform
,LowerCholeskyTransform
, and their inverses via the.inv
property.Distribution Constraints
π Distributions provide metadata about the constraints of their
.support
and about their arguments (.arg_constraints
). TheseConstraint
objects are registered with transforms usingtransform_to()
andbiject_to()
. Together constraints and transforms make it easy to specify new distributions in a generic way\>\>\> scale = torch.tensor(1., requires\_grad=True)\>\>\> p = Normal(0., scale)\>\>\> assert p.arg\_constraints['scale'] == constraints.positive\>\>\> prior = TransformedDistribution(Normal(0., 1.),... transform\_to(constraints.positive))
Constraints in the
torch.distributions.constraints
library include:boolean
,greater_than(lower_bound)
,integer_interval(lower_bound, upper_bound)
,interval(lower_bound, upper_bound)
,lower_cholesky
,lower_triangular
,nonnegative_integer
,positive
,positive_definite
,positive_integer
,real
,real_vector
,simplex
, andunit_interval
.Distributed
π· Helper utility for launching Distributed Training jobs
π· We have added an utility function to help launch jobs on a distributed setup.
In order to launch a script that leveragesDistributedDataParallel
on either single-node multiple-nodes, we can make use of torch.distributed launch as followspython -m torch.distributed.launch my_script.py --arg1 --arg2 --arg3
π¦ The script simplifies day to day usability of the
distributed
package.π You can read about it's usage here: http://pytorch.org/docs/stable/distributed.html#launch-utility
A new distributed backend based on NCCL 2.0
PyTorch now has a new distributed backend, which leverages NCCL 2.0 for maximum speed.
It also provides new APIs for collective operations on multiple GPUs.
You can enable the new backend viatorch.distributed.init\_process\_group("nccl")
Other distributed improvements
- π Coalesce many small broadcasts to improve performance #4978
- β Add mixed-precision support for distributed training #4891
- π Release NCCL distributed backend. Previously it was marked as
experimental
. #4921 - π Enable Infiniband support for Gloo data channel with automatic IB device detection #4795
C++ extensions
Previously, the official way of writing extensions using C or CUDA for custom modules was through the cffi extension. The drawback of this method was that it required a separate step for compiling the CUDA kernels, which could be a bit messy.
π PyTorch now provides a better system for writing your own C++ / CUDA extensions. Example implementations using this new extension support can be found in the pytorch/cpp_extensions repo.
We provide two compilation modes:
- ahead of time compilation: you write a
setup.py
script using the newCppExtension
orCUDAExtension
, which is an extension ofsetuptools.Extension
module; - just-in-time compilation: you pass the list of C++ / CUDA files that you want to compile to
torch.utils.cpp_extension.load
, and it will compile on the fly and cache the libraries for you. Here is an example illustrating how easy it is to implement an extension:
In C++
// my\_implementation.cpp#include \<torch/torch.h\>#include \<unordered\_set\>// can use templates as well. But let's keep it// simpleusing scalar\_t = float; at::Tensor unique\_float(at::Tensor input\_) { // only works for floatsAT\_ASSERT(input\_.type().scalarType() == at::ScalarType::Float, "input must be a float tensor"); // and CPU tensorsAT\_ASSERT(!input\_.type().is\_cuda(), "input must be a CPU tensor"); // make the input contiguous, to simplify the implementation at::Tensor input = input\_.contiguous(); // get the pointer that holds the datascalar\_t\* input\_data = input.data\<scalar\_t\>(); // let's use a function from the std library to implement// the unique function std::unordered\_set\<scalar\_t\> set(input\_data, input\_data + input.numel()); // create the output tensor, with size set.size() at::Tensor output = input.type().tensor({static\_cast\<int64\_t\>(set.size())}); scalar\_t\* output\_data = output.data\<scalar\_t\>(); // copy the content of the set to the output tensorstd::copy(set.begin(), set.end(), output\_data); return output; }// this defines the functions exposed to PythonPYBIND11\_MODULE(TORCH\_EXTENSION\_NAME, m) { m.def("unique\_float", &unique\_float, "Unique for float tensors"); }
And then in Python
import torchfrom torch.utils.cpp\_extension import load as load\_ext# pass the source files, they will be compiled on the fly # and will return a python module\_C = load\_ext('my\_unique\_lib', sources=['my\_implementation.cpp'])# now can use the functions implemented in C++unique = \_C.unique\_float a = torch.tensor([1.0, 2.0, 1.0])print(unique(a))# tensor([2., 1.])
π Windows support
π PyTorch now officially supports Windows. We provide pre-compiled Conda binaries and pip wheels for Python 3.5 and 3.6.
π§ PyTorch on Windows doesn't supportdistributed
training and might be a tad bit slower than Linux / OSX because Visual Studio supports an older version of OpenMP.π As always, you can use the commands at http://pytorch.org to install PyTorch on Windows
π We have an FAQ that answers most questions you might have around Windows here: http://pytorch.org/docs/stable/notes/windows.htmlONNX Improvements
π New ONNX operators
- π Support export
torch.max(input, dim)
andtorch.min(input, dim)
#6220 - β Add symbolic for
ReLU
to support exporting to ONNX #5759 - β Add
sum
,prod
,sqrt
and improvelog_softmax
#4579 - β Add ONNX support for
InstanceNorm
#4626 - β Add ONNX symbolic for
Elu
#3453 - β Add ONNX symbolic for
UpsamplingNearest2d
#3450
π Improvements
- π¨ Print source location when ONNX export fails for a node #5652
- Export onnx protobuf bindings to python #6651
- π Support
output_padding
inConvTranspose
#4583
π Better RNN support
PyTorch can now export a subset of RNNs to ONNX #4409
- β Add Elman RNN export to ONNX #4613
- π Support batch-first in ONNX export of padded sequences #5360
- Bidirectional Elman RNN export to ONNX #5120
- π Handle sequence lengths correctly when exporting RNNs to ONNX #4695
- π Support GRU export to ONNX #4390
π Bugfixes
- π Fix a bug in ONNX symbolic of 3d average pooling #6101
- π Fix onnx export of replication/reflection pad #4263
Miscellaneous improvements
implement
__dir__
for Tensors, so that editors can automatically auto-complete and query for the possible fields in Tensorsβ Add
numpy()
andfrom_numpy()
toHalfTensor
Enable
TensorDataset
to have any number of input tensors.Add
padding_value
totorch.nn.utils.rnn.pad_sequence
Add
total_length
option topack_padded_sequence
, which is useful when usingDataParallel
, as we can ensure that we have sequences of the same length.π Improve numerical precision of
torch.arange
, making it consistent withnumpy.arange
π
torch.load()
andtorch.save()
support arbitrary file-like objectπ
torch.nn.functional.grid_sample
now supports 2D (spatial) and 3D (volumetric) inputsπ set python random seed in
DataLoader
workers, in order to improve experiment reproducibilityAdd
__delitem__
tonn.Sequential
. Now one can delete arbitrary elements of ann.Sequential
.For example:
model = nn.Sequential(nn.Linear(2, 2), nn.ReLU(), nn.Linear(2, 2))del model[1] # deletes nn.ReLU
ReduceLROnPlateau
is now serializable #5300β Add option to flush denormal numbers on CPU. #5294
PyTorch now exposes the gradients of conv1d, conv2d and conv3d with respect to the input and the weights #5408
Add support for calling
pack_padded_sequence
with either list or with a Tensor #5133π Support negative indexing for
padding_idx
innn.Embedding
#4496Implement backward pass for
pack_padded_sequence
#4512Add
nn.utils.rnn.pad_sequence
andnn.utils.rnn.pack_sequence
to pad lists of variable length Tensors with0
and to pack a list of variable length Tensors.Add
torch.cuda.memory_cached
,torch.cuda.max_memory_cached
,torch.cuda.memory_allocated
, andtorch.cuda.max_memory_allocated
methods
for checking CUDA memory usage #4511π Allow viewing on noncontiguous tensors if the new view size is compatible with the tensor's original size and stride. #4062
π
NLLLoss
andCrossEntropyLoss
now support more than 2 dimensions. #4654β Add an option to not show
model_zoo
download progress bar #4135You can now assign modules to indices of
nn.Sequential
. #4931You can create tensors with a numpy
np.longlong
array #4367π Change the autograd execution order to use good heuristics. This greatly improves memory usage for large models. #4746
β Add AMSgrad mode to
Adam
andSparseAdam
optmizers. #4034π Better
torch.autograd.profiler
support for CUDA profiling using thecudaEvent
API. #3734torch.set_num_threads
also sets the respective MKL option so you won't need to use an environment variable to control it. #4949π Performance improvements
- Speed up CPU
nn.EmbeddingBag
, making training overall 30% faster #5433 - π Move
nn.MarginRankingLoss
,nn.CosineEmbeddingLoss
,nn.HingeEmbeddingLoss
, andnn.TripletMarginLoss
from Python to our ATen backend, resulting in some cases up to a 3x performance gains.
#5346, #5646, #5080, #5680 - Implement
pin_memory()
as a NativeFunction #4094 - πΎ Save
self.numel()
for backward computation instead ofself
to save memory #5747 - π Rearrange dimensions for pointwise operations for up to 10x better performance in one case. #4174
- Vectorize
normal_
for a 5-6x speed up in a small case #4312 - π Allowing usage of GPU Direct within PyTorch for the Broadcast operation #4183
- Speed-up
nn.Linear
for the 3D input case #5279 - Speed up
Conv3D
on the CPU by parallelizingvol2col
andcol2vol
#4824 - β Add AVX2 implementation for sigmoid function, showing around 10x speedup #5010
- π Use fast integer division algorithm to avoid division ops inside kernels. #5054
- π Improve occupancy for CUDA random number generation #5710
- β Add optimization to norm for common norms #5722
- β Add a fast fused GLU backward #5782
- β‘οΈ Optimize unique sorting by using
std::vector+sort
instead ofstd::set
, giving up to 5x speedup. #5913 - Speed up sum over a dimension #6026
- Enable MKLDNN convolution forward and backward. #6062
- Parallelize non-contiguous point-wise operations with OpenMP #2764
- β Add cudnn Tensor Core ops to RNNs for Volta #3409
- π² Vectorize
exp
,log
,sin
,cos
#6078 - Reuse intermediate results over multiple backwards grad_inputs #3526
Distributed
- π DistributedDataParallel: 10% of NCCL backend perf improvements with mixed-precision support #5064
- π Slightly improve DistributedDataParallel (single-GPU binding) multi-process distributed training performance #4870
π Bug fixes
torch operators
- π Improve
torch.digamma
precision near poles #6517 - π Fix incorrect behavior of
Tensor.random_
on negative inputs #6463 - π Fix undefined behavior in backward pass for
tensor.permute(dims)
with negative dims #5945 - π Fix integer overflow in
torch.remainder
operator (it would break with a divisor above2**48
) #5906 - π Fix memory leak in
torch.bmm
#5744 - β Make dimension checker of
scatter_add_
consistent withscatter_
's #5659 - π Fix CPU
torch.multinomial
with noncontiguous probability tensor input (previously, it would overwrite input data)#5093 - π Fix CUDA
torch.multinomial
using incorrect strides and being able to select zero-probability events. #5774, #5238 - π Support empty index tensor for
index_select
#3429 - π Support empty indices tensor in CUDA
Tensor.put_
#4486 - π Improve stability of
torch.cat
with empty tensors #3602, #5971, #5819 - π Fix
torch.fft
in the case where any of the input dimensions is not aligned #6118 - π Improve the CUDA btrifact error message #5644
- Return zeros for eigenvector tensor when not requested in
torch.symeig
#3411 - π Fix
torch.btrifact
on tensors. #4318 - π Fix
torch.pstrf
on tensors. #4883 - π Fix memory leak in
torch.median
6889 - π Fix SVD backward on non-square matrices when
some=False
6870
core
- Detect re-initialization of
_C
shared library that would often result in segfaults on exit #6232 - π Fix indexing with all zero ByteTensors #3926
- 0οΈβ£ Only allow dense floating-point types as the default tensor type. #5674
- π Initialize CUDA before setting CUDA tensor types as default to prevent crash #4788
- π Fix a bug where
from_dlpack
fails if CUDA is not initialized. #4182 - π Fix crash in creating a CUDA tensor with a numpy array #5850
- π Fix broken sharing of empty tensor in multiprocessing on some OSes #6229
autograd
- βͺ Restore allow_unused functionality: throw error when differentiated input is unused or unreachable. #6553
- Fix
output_nr
not being incremented correctly. This caused crashes in the backward pass of operations that don'trequires_grad
on some inputs. #4812 - π Fix nvprof parsing in the
torch.autograd.profiler
#5840
nn layers
- π Support only specifying size in certain dimension for adaptive pooling #3127
- π Fix reflection padding boundary checks to not cause invalid memory access #6438
- π Improve error messages for
NLLLoss
. #5299, #6072 - π Fix
kl_div
backward on CUDA. Previously it would not respectgradOutput
when computinggradInput
. #5814 - π Fix incorrect
bias
size assert forLinear
#5992 - π Fix incorrect
nn.functional.convNd
andnn.functional.conv_transposeNd
error message #5701 - Check that shape for input and target matches instead of number of elements for some loss functions #5085
- π Fix
torch.diag
backward returning square grad with non-square input #4538 - π Fix convolution type mismatch error message #5815
- β Add
align_corners
option to linearly interpolating upsampling and make the default upsampling behavior more consistent with other frameworks #5927 - Prevent numerical issues with
poisson_nll_loss
when log_input=False #3336
CUDA
- Ensure convolution weights are contiguous to fix CUDA
ConvTranspose
double backward #4543 - π Fix CUDA double backwards #4460
π sparse
- π Fix embedding with
sparse=True
#4686 - π Fix sparse embedding backward when input contains only
padding_idx
#6211 - π Handle copying empty sparse tensors to/from CPU, GPU. #5361
dataloader
- β Add argument checks to the
torch.utils.data.Sampler
classes, fixing a bug whereDataLoader
tries to load the entire dataset on non-integerbatch_size
. #6249 - Set
dataloader.batch_size = None
when batch_sampler is given, fixing a bug whereDataLoader
would reportbatch_size
as1
. #6108 - π Improve signal handling in
DataLoader
#4643 - Ignore
FileNotFoundError
when shutting down #5380 - π Make preprocessing deterministic #4640
optim
- β‘οΈ Cast tensors when loading optimizer state dicts to improve usability #3658
- List model parameters in deterministic order to improve stability of
load_state_dict()
#6031 - β Add parameter range checks for all optimizers #6000
- π Fix
AMSGrad
mode forSparseAdam
#4314
distributed and multi-gpu
- Major Core Changes