PyTorch - 内部建筑之旅

更新2019年2月28日:我添加了一个新的博客文章用幻灯片平台containing the presentation I did for PyData Montreal.


This post is a tour around the PyTorch codebase, it is meant to be a guide for the architectural design of PyTorch and its internals. My main goal is to provide something useful for those who are interested in understanding what happens beyond the user-facing API and show something new beyond what was already covered in other tutorials.

Note:PyTorch build system uses code generation extensively so I won’t repeat here what was already described by others. If you’re interested in understanding how this works, please read the following tutorials:

Short intro to Python extension objects in C/C++

正如你可能知道的,可以扩展Python使用C and C++ and develop what is called as “extension”. All the PyTorch heavy work is implemented in C/C++ instead of pure-Python. To define a new Python object type in C/C++, you define a structure like this one example below (which is the base for the autograd变量class):

// Python object that backs torch.autograd.Variable struct THPVariable { PyObject_HEAD torch::autograd::Variable cdata; PyObject* backward_hooks; };


有Python的API中的两个额外的宏叫Py_INCREF()Py_DECREF(), which are used to increment and decrement the reference counter of Python objects. Multiple entities can borrow or own a reference to other objects (the reference counter is increased), and only when this reference counter reaches zero (when all references get destroyed), Python will automatically delete the memory from that object using its garbage collector.

你可以阅读更多有关Python C / ++扩展这里

Funny fact: it is very common in many applications to use small integer numbers as indexing, counters, etc. For efficiency, the officialCPython的解释从-5高达256缓存整数出于这个原因,语句一个= 200;B = 200;A是Bwill be真正, while the statementa = 300; b = 300; a is bwill beFalse


PyTorch has its own Tensor representation, which decouples PyTorch internal representation from external representations. However, as it is very common, especially when data is loaded from a variety of sources, to have Numpy arrays everywhere, therefore we really need to make conversions between Numpy and PyTorch tensors. For that reason, PyTorch provides two methods calledfrom_numpy()numpy的(), that converts a Numpy array to a PyTorch array and vice-versa, respectively. If we look the code that is being called to convert a Numpy array into a PyTorch tensor, we can get more insights on the PyTorch’s internal representation:

在::张量tensor_from_numpy(*的PyObject OBJ){如果(!PyArray_Check(OBJ)){抛出类型错误( “预期np.ndarray(得到的是%S)”,Py_TYPE(OBJ) - > tp_name);}自动阵列=(PyArrayObject *)OBJ;INT NDIM = PyArray_NDIM(数组);自动调整大小= to_aten_shape(NDIM,PyArray_DIMS(阵列));自动步幅= to_aten_shape(NDIM,PyArray_STRIDES(阵列));// NumPy的进步使用字节。火炬大步使用元素计数。自动element_size_in_bytes = PyArray_ITEMSIZE(数组);为(自动&步幅:步幅){步幅/ = element_size_in_bytes;} //(...) - 为了简洁省略无效* DATA_PTR = PyArray_DATA(数组); auto& type = CPU(dtype_to_aten(PyArray_TYPE(array))); Py_INCREF(obj); return type.tensorFromBlob(data_ptr, sizes, strides, [obj](void* data) { AutoGIL gil; Py_DECREF(obj); }); }

(code fromtensor_numpy.cpp)


此外,还有一个重要的点这里:当numpy的数组对象超出范围了,并得到一个零引用计数,它会被垃圾收集和销毁, that’s why there is an increment in the reference counting of the Numpy array object at line 20.

在此之后,PyTorch将创建从该numpy的数据blob一个新的张量对象,并且在创建新张量的它与存储器的大小和进展以及稍后将被使用的功能通过借用存储数据指针,一起the Tensor Storage (we’ll discuss this in the next section) to release the data by decrementing the reference counting to the Numpy array object and let Python take care of this object life cycle.




正如我们在前面的代码看到从tensor_from_numpy(),有一个呼叫tensorFromBlob()that will create a Tensor from the raw data blob. This last function will call another function called storageFromBlob() that will, in turn, create a storage for this data according to its type. In the case of a CPU float type, it will return a newCPUFloatStorage实例。

The CPUFloatStorage is basically a wrapper with utility functions around the actual storage structure calledTHFloatStorage我们下面显示:

typedef结构THStorage {真实*数据;ptrdiff_t的大小;INT引用计数;焦标志;THAllocator *分配器;无效* allocatorContext;STRUCT THStorage *图;} THStorage;

(code fromTHStorage。h)

As you can see, theTHStorage具有指向原始数据,它的尺寸,标志和也是一个有趣的领域被称为allocator我们马上要讨论的。同样重要的是要注意,关于如何解释里面的数据没有元数据THStorage, this is due to the fact that the storage is “dumb” regarding of its contents and it is the Tensor responsibility to know how to “view” or interpret this data.


>>> tensor_a = torch.ones((3, 3)) >>> tensor_b = tensor_a.view(9) >>> == True


Now, as we saw in line 7 of theTHFloatStorage结构中,存在一个指向THAllocator构建那里。而且因为它带来了关于可用亚洲金博宝于分配存储数据分配器的灵活性,这是非常重要的。这种结构通过下面的代码表示:

typedef struct THAllocator { void* (*malloc)(void*, ptrdiff_t); void* (*realloc)(void*, void*, ptrdiff_t); void (*free)(void*, void*); } THAllocator;

(code fromTHAllocator.h)

正如你所看到的,也有在这个结构中三个功能指针字段定义什么分配器手段:一个malloc,realloc的和免费的。对于CPU分配的内存,这些功能将,当然,涉及到传统的malloc / realloc的/自由POSIX的功能,但是,当我们要对我们最终会使用CUDA的GPU分配器,如分配的存储cudaMallocHost(),就像我们可以在看THCudaHostAllocator下面malloc函数:

静态无效* THCudaHostAllocator_malloc(无效* CTX,ptrdiff_t的大小){void *的PTR;如果(大小<0)THError( “无效的存储器大小:%LD”,大小);如果(大小== 0)返回NULL;THCudaCheck(cudaMallocHost(PTR,大小));返回PTR;}

(code fromTHCAllocator.c)

你可能注意到在库组织模式,但要记住这些公约是很重要的导航信息库时,这里总结(取自PyTorch LIB自述):

  • TH=TorcH
  • THC=TorcHC乌达
  • 乡镇卫生院=TorcHC乌达Sparse
  • THCUNN=TorcHCUDANeuralNetwork
  • THD=TorcHDistributed
  • THNN=TorcHNeuralNetwork
  • THS=TorcH 2 Sparse

This convention is also present in the function/class names and other objects, so it is important to always keep these patterns in mind. While you can find CPU allocators in the TH code, you’ll find CUDA allocators in the THC code.


typedef结构THTensor {*的int64_t大小;*的int64_t步幅;INT n标注;THStorage *存储;ptrdiff_t的storageOffset;INT引用计数;焦标志;} THTensor;

(Code fromTHTensor.h)

And as you can see, the mainTHTensor结构保持的尺寸/步幅/尺寸/偏移/等以及存储(THStorage),用于张量数据。


Now, once we have requirements such as multi-processing where we want to share tensor data among multiple different processes, we need a shared memory approach to solve it, otherwise, every time another process needs a tensor or even when you want to implementHogwildtraining procedure where all different processes will write to the same memory region (where the parameters are), you’ll need to make copies between processes, and this is very inefficient. Therefore we’ll discuss in the next section a special kind of storage for Shared Memory.



PyTorch周围提供了Python的包装module and can be imported fromtorch.multiprocessing。The changes they implemented in this wrapper around the official Python multiprocessing were done to make sure that everytime a tensor is put on a queue or shared with another process, PyTorch will make sure that only a handle for the shared memory will be shared instead of a new entire copy of the Tensor.

Now, many people aren’t aware of a Tensor method from PyTorch calledshare_memory_()然而,这个功能是什么触发了整个重建该特定张量的存储内存。什么这种方法确实是创建共享存储器的区域,可以不同的过程中被使用。该功能将在年底,拨打以下这个如下功能:

静态THStorage * THPStorage_(newFilenameStorage)(ptrdiff_t的大小){INT标志= TH_ALLOCATOR_MAPPED_SHAREDMEM |TH_ALLOCATOR_MAPPED_EXCLUSIVE;的std :: string手柄= THPStorage _(__ newHandle)();自动CTX = libshm_context_new(NULL,handle.c_str(),标志);返回THStorage_(newWithAllocator)(大小,&THManagedSharedAllocator,(无效*)CTX);}

(Code fromStorageSharing.cpp)

正如你所看到的,这个功能将创建使用称为特殊分配器另一个存储THManagedSharedAllocator。该功能首先定义一些标志,然后它创建一个手柄,其在格式的字符串/ torch_ [进程id] _ [随机数],之后,它就会使用特殊创建一个新的存储THManagedSharedAllocator。该分配器具有函数指针称为内部PyTorch库libshm, that will implement aUnix域套接字communication to share the shared memory region handles. This allocator is actual an especial case and it is a kind of “smart allocator” because it contains the communication control logic as well as it uses another allocator calledTHRefcountedMapAllocator这将是负责创建实际的共享内存区域和呼叫mmap()于该区域映射到进程的虚拟地址空间。

Note:当与一个PyTorch下划线的方法结束时,如该方法称为share_memory_(), it means that this method has an in-place effect, and it will change the current object instead of creating a new one with the modifications.

I’ll now show a Python example of one processing using the data from a Tensor that was allocated on another process by manually exchanging the shared memory handle:

This is executed in the process A:

>>>进口炬>>> tensor_a = torch.ones((5,5))>>> tensor_a 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [尺寸5×5的torch.FloatTensor] >>> tensor_a.is_shared()假>>> tensor_a = tensor_a.share_memory_()>>> tensor_a.is_shared()真>>> tensor_a_storage =>>> tensor_a_storage。_share_filename_()(b '/ var / tmp中/ tmp.0.yowqlr',b '/ torch_31258_1218748506',25)

In this code, executed in theprocess A我们创建充满的人的5×5的一个新的张量。之后,我们让共享和打印与Unix域套接字地址元组以及手柄。现在,我们可以从另一个访问该存储区域process Bas shown below:


>>> import torch >>> tensor_a = torch.Tensor() >>> tuple_info = (b'/var/tmp/tmp.0.yowqlr', b'/torch_31258_1218748506', 25) >>> storage = torch.Storage._new_shared_filename(*tuple_info) >>> tensor_a = torch.Tensor(storage).view((5, 5)) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [torch.FloatTensor of size 5x5]

正如你所看到的,使用关于Unix域套接字的地址和手柄的元组信息,我们能够从另一个访问的过程中张量存储。如果您更改此张量process B, you’ll also see that it will reflect in theprocess A因为这些张量共享相同的存储区。

DLPack: a hope for the Deep Learning frameworks Babel

现在我想谈谈在PyTorch代码库的东西最近,被称为DLPack。DLPack是一个内存张量结构,其将允许交换张量数据的一个开放的标准化between frameworks,什么是相当有趣的是,因为这个内存的表示是标准化和非常相似的内存中表示已经通过许多框架的使用,这将允许亚洲金博宝框架之间的零拷贝数据共享,这是考虑到各种框架,我们今天有没有它们之间相互通信的相当惊人的举措。

This will certainly help to overcome the “island model” that we have today between tensor representations in MXNet, PyTorch, etc, and will allow developers to mix framework operations between frameworks and all the benefits that a standardization can bring to the frameworks.

DLPack的核心操作系统被称为结构非常简单亚洲金博宝DLTensor, as shown below:

/ *!* \简短的纯C张量的对象,不管理内存。* / typedef结构{/ *!* \介绍不透明数据指针指向分配的数据。*这将在CUDA的OpenCL设备指针或cl_mem手柄。*这个指针始终对齐到256个字节的CUDA。* / void *的数据;/ *!\介绍张量* / DLContext CTX的设备上下文/ *! \brief Number of dimensions */ int ndim; /*! \brief The data type of the pointer*/ DLDataType dtype; /*! \brief The shape of the tensor */ int64_t* shape; /*! * \brief strides of the tensor, * can be NULL, indicating tensor is compact. */ int64_t* strides; /*! \brief The offset in bytes to the beginning pointer to data */ uint64_t byte_offset; } DLTensor;

(code fromdlpack.h)

正如可以看到,对于原始数据的数据指示字,以及形状/步幅/偏移/ GPU VS CPU,和其它元数据的信息有关的数据,该DLTensorpointing to.

There is also a managed version of the tensor that is calledDLManagedTensor, where the frameworks can provide a context and also a “deleter” function that can be called by the framework who borrowed the Tensor to inform the other framework that the resources are no longer required.

在PyTorch,如果要转换或从DLTensor格式,你可以找到这样做,甚至在Python中,你可以做如下所示,两个C / C ++的方法:

import torch from torch.utils import dlpack t = torch.ones((5, 5)) dl = dlpack.to_dlpack(t)

This Python function will call thetoDLPackfunction from ATen, shown below:

DLManagedTensor* toDLPack(const Tensor& src) { ATenDLMTensor * atDLMTensor(new ATenDLMTensor); atDLMTensor->handle = src; atDLMTensor->tensor.manager_ctx = atDLMTensor; atDLMTensor->tensor.deleter = &deleter; atDLMTensor-> = src.data_ptr(); int64_t device_id = 0; if (src.type().is_cuda()) { device_id = src.get_device(); } atDLMTensor->tensor.dl_tensor.ctx = getDLContext(src.type(), device_id); atDLMTensor->tensor.dl_tensor.ndim = src.dim(); atDLMTensor->tensor.dl_tensor.dtype = getDLDataType(src.type()); atDLMTensor->tensor.dl_tensor.shape = const_cast(src.sizes().data()); atDLMTensor->tensor.dl_tensor.strides = const_cast(src.strides().data()); atDLMTensor->tensor.dl_tensor.byte_offset = 0; return &(atDLMTensor->tensor); }

As you can see, it’s a pretty simple conversion, casting the metadata from the PyTorch format to the DLPack format and assigning a pointer to the internal Tensor data representation.


That’s it, I hope you liked this long post !

- 基督教S. Perone

Cite this article as: Christian S. Perone, "PyTorch – Internal Architecture Tour," inTerra Incognita,2018年12月3日,//

13个想法“PyTorch - 内部建筑之旅”

  1. Great post! Very interesting to see the details of Pytorch as well as to know that it is well-implemented.

  2. 好贴!不过,我想你最好添加源代码版本,因为底层后端正在迅速发生变化,有些链接已经断开。

  3. hi Christian, thanks for the insider details on pytorch.



    torch_array = torch.from_numpy(numpy_array)#小于1毫秒
    做torch_array#处理不到1毫秒施加GPU @ 99%
    numpy_array = np.array(torch_array) # greater than 200 msec

    GPU = nvidia on jetson TX1 platform
    火炬= 0.4.0

    关于^ h

    1. You should use .numpy().

      torch_array = torch.from_numpy(numpy_array)
      numpy_array = torch_array.numpy()

  4. 谢谢你为一个伟大的帖子!这真的帮助我理解张量存储是如何工作的。现在我可以检查两个张量共享相同的存储(由`。date_ptr()==。DATA_PTR()`),但我怎么能检查是否numpy的阵列的视图张量?有没有办法做到PyTorch和numpy的之间类似的检查?谢谢你的忠告提前!

    1. 您可以使用:n_array .__ array_interface __ [“数据”],然而,这仅仅是用于说明目的,因为比较原始指针不是一个很好的主意。亚洲金博宝

  5. 伟大的职位,但它确实帮助了我很多理解Pytorch存储。

    我的理解是,我可以从一个STL向量C ++ pytorch张量,并通过pybind暴露到Python无副本。

    我不知道如果我能揭露从C ++ STL的一个载体向Python和从它创建一个张没有进行复印,尽管说torch.tensor总是拷贝数据

Leave a Reply to埃里克取消回复

Your email address will not be published.