PyTorch - 内部建筑之旅

更新2019年2月28日:我添加了一个新的博客文章用幻灯片平台containing the presentation I did for PyData Montreal.

介绍

This post is a tour around the PyTorch codebase, it is meant to be a guide for the architectural design of PyTorch and its internals. My main goal is to provide something useful for those who are interested in understanding what happens beyond the user-facing API and show something new beyond what was already covered in other tutorials.

Note:PyTorch build system uses code generation extensively so I won’t repeat here what was already described by others. If you’re interested in understanding how this works, please read the following tutorials:

Short intro to Python extension objects in C/C++

正如你可能知道的,可以扩展Python使用C and C++ and develop what is called as “extension”. All the PyTorch heavy work is implemented in C/C++ instead of pure-Python. To define a new Python object type in C/C++, you define a structure like this one example below (which is the base for the autograd变量class):

// Python object that backs torch.autograd.Variable struct THPVariable { PyObject_HEAD torch::autograd::Variable cdata; PyObject* backward_hooks; };

正如你所看到的,有在定义的开始宏,叫PyObject_HEAD,该宏的目标是Python中的标准化的对象和将扩大到包含一个指向类型的对象(其限定初始化方法,分配器等),也与参考计数器的场另一种结构。

有Python的API中的两个额外的宏叫Py_INCREF()Py_DECREF(), which are used to increment and decrement the reference counter of Python objects. Multiple entities can borrow or own a reference to other objects (the reference counter is increased), and only when this reference counter reaches zero (when all references get destroyed), Python will automatically delete the memory from that object using its garbage collector.

你可以阅读更多有关Python C / ++扩展这里

Funny fact: it is very common in many applications to use small integer numbers as indexing, counters, etc. For efficiency, the officialCPython的解释从-5高达256缓存整数出于这个原因,语句一个= 200;B = 200;A是Bwill be真正, while the statementa = 300; b = 300; a is bwill beFalse

PyTorch张量零拷贝到numpy的,反之亦然

PyTorch has its own Tensor representation, which decouples PyTorch internal representation from external representations. However, as it is very common, especially when data is loaded from a variety of sources, to have Numpy arrays everywhere, therefore we really need to make conversions between Numpy and PyTorch tensors. For that reason, PyTorch provides two methods calledfrom_numpy()numpy的(), that converts a Numpy array to a PyTorch array and vice-versa, respectively. If we look the code that is being called to convert a Numpy array into a PyTorch tensor, we can get more insights on the PyTorch’s internal representation:

在::张量tensor_from_numpy(*的PyObject OBJ){如果(!PyArray_Check(OBJ)){抛出类型错误( “预期np.ndarray(得到的是%S)”,Py_TYPE(OBJ) - > tp_name);}自动阵列=(PyArrayObject *)OBJ;INT NDIM = PyArray_NDIM(数组);自动调整大小= to_aten_shape(NDIM,PyArray_DIMS(阵列));自动步幅= to_aten_shape(NDIM,PyArray_STRIDES(阵列));// NumPy的进步使用字节。火炬大步使用元素计数。自动element_size_in_bytes = PyArray_ITEMSIZE(数组);为(自动&步幅:步幅){步幅/ = element_size_in_bytes;} //(...) - 为了简洁省略无效* DATA_PTR = PyArray_DATA(数组); auto& type = CPU(dtype_to_aten(PyArray_TYPE(array))); Py_INCREF(obj); return type.tensorFromBlob(data_ptr, sizes, strides, [obj](void* data) { AutoGIL gil; Py_DECREF(obj); }); }

(code fromtensor_numpy.cpp)

正如你可以从这个代码中看到,PyTorch是获得来自numpy的表示所有的信息(数组元数据),然后创建自己的。然而,如可以从标记线18注意,PyTorch是得到一个指针到内部numpy的阵列的原始数据而不是复制。这意味着PyTorch将创建该数据的参考,与用于原始数据张量的numpy的数组对象共享相同的存储器区域。

此外,还有一个重要的点这里:当numpy的数组对象超出范围了,并得到一个零引用计数,它会被垃圾收集和销毁, that’s why there is an increment in the reference counting of the Numpy array object at line 20.

在此之后,PyTorch将创建从该numpy的数据blob一个新的张量对象,并且在创建新张量的它与存储器的大小和进展以及稍后将被使用的功能通过借用存储数据指针,一起the Tensor Storage (we’ll discuss this in the next section) to release the data by decrementing the reference counting to the Numpy array object and let Python take care of this object life cycle.

ThetensorFromBlob()方法将创建一个新的张量,但只有创建一个新的“存储”这个张量之后。存储是其中实际数据指针将被存储(而不是在张量结构本身)。这需要我们去了解下部分张量存储s

张量存储

张量的实际原始数据不直接保存在张量结构,但在另一种结构称为存储,这又是张量结构的一部分。

正如我们在前面的代码看到从tensor_from_numpy(),有一个呼叫tensorFromBlob()that will create a Tensor from the raw data blob. This last function will call another function called storageFromBlob() that will, in turn, create a storage for this data according to its type. In the case of a CPU float type, it will return a newCPUFloatStorage实例。

The CPUFloatStorage is basically a wrapper with utility functions around the actual storage structure calledTHFloatStorage我们下面显示:

typedef结构THStorage {真实*数据;ptrdiff_t的大小;INT引用计数;焦标志;THAllocator *分配器;无效* allocatorContext;STRUCT THStorage *图;} THStorage;

(code fromTHStorage。h)

As you can see, theTHStorage具有指向原始数据,它的尺寸,标志和也是一个有趣的领域被称为allocator我们马上要讨论的。同样重要的是要注意,关于如何解释里面的数据没有元数据THStorage, this is due to the fact that the storage is “dumb” regarding of its contents and it is the Tensor responsibility to know how to “view” or interpret this data.

由此看来,你已经大概意识到,我们可以有指向同一个存储,但与此不同的数据视图多张量,这就是为什么有不同的形状观看张量(但保持相同数量的元素),效率非常高。此Python代码下面示出的是,在存储器中的数据指针被改变张量认为其数据的方式后共享:

>>> tensor_a = torch.ones((3, 3)) >>> tensor_b = tensor_a.view(9) >>> tensor_a.storage().data_ptr() == tensor_b.storage().data_ptr() True

正如我们可以在上面的例子中看到的,在两个张量的存储数据指针是相同的,但张量表示存储的数据的不同的解释。

Now, as we saw in line 7 of theTHFloatStorage结构中,存在一个指向THAllocator构建那里。而且因为它带来了关于可用亚洲金博宝于分配存储数据分配器的灵活性,这是非常重要的。这种结构通过下面的代码表示:

typedef struct THAllocator { void* (*malloc)(void*, ptrdiff_t); void* (*realloc)(void*, void*, ptrdiff_t); void (*free)(void*, void*); } THAllocator;

(code fromTHAllocator.h)

正如你所看到的,也有在这个结构中三个功能指针字段定义什么分配器手段:一个malloc,realloc的和免费的。对于CPU分配的内存,这些功能将,当然,涉及到传统的malloc / realloc的/自由POSIX的功能,但是,当我们要对我们最终会使用CUDA的GPU分配器,如分配的存储cudaMallocHost(),就像我们可以在看THCudaHostAllocator下面malloc函数:

静态无效* THCudaHostAllocator_malloc(无效* CTX,ptrdiff_t的大小){void *的PTR;如果(大小<0)THError( “无效的存储器大小:%LD”,大小);如果(大小== 0)返回NULL;THCudaCheck(cudaMallocHost(PTR,大小));返回PTR;}

(code fromTHCAllocator.c)

你可能注意到在库组织模式,但要记住这些公约是很重要的导航信息库时,这里总结(取自PyTorch LIB自述):

  • TH=TorcH
  • THC=TorcHC乌达
  • 乡镇卫生院=TorcHC乌达Sparse
  • THCUNN=TorcHCUDANeuralNetwork
  • THD=TorcHDistributed
  • THNN=TorcHNeuralNetwork
  • THS=TorcH 2 Sparse

This convention is also present in the function/class names and other objects, so it is important to always keep these patterns in mind. While you can find CPU allocators in the TH code, you’ll find CUDA allocators in the THC code.

最后,我们可以看到主要的张量的组成THTensorstructure:

typedef结构THTensor {*的int64_t大小;*的int64_t步幅;INT n标注;THStorage *存储;ptrdiff_t的storageOffset;INT引用计数;焦标志;} THTensor;

(Code fromTHTensor.h)

And as you can see, the mainTHTensor结构保持的尺寸/步幅/尺寸/偏移/等以及存储(THStorage),用于张量数据。

我们可以概括所有这种结构,我们在下图中看到:

Now, once we have requirements such as multi-processing where we want to share tensor data among multiple different processes, we need a shared memory approach to solve it, otherwise, every time another process needs a tensor or even when you want to implementHogwildtraining procedure where all different processes will write to the same memory region (where the parameters are), you’ll need to make copies between processes, and this is very inefficient. Therefore we’ll discuss in the next section a special kind of storage for Shared Memory.

共享内存

共享内存可以根据平台支持多种不同的方式来实现。PyTorch支持其中的一些,但是为了简单起见,我会在这里讨论关于使用CPU(而非GPU)的MacOS会发生什么。由于PyTorch支持多种共享存储器方法,这部分是有点棘手把握成,因为它涉及间接在代码多个级别。

PyTorch周围提供了Python的包装module and can be imported fromtorch.multiprocessing。The changes they implemented in this wrapper around the official Python multiprocessing were done to make sure that everytime a tensor is put on a queue or shared with another process, PyTorch will make sure that only a handle for the shared memory will be shared instead of a new entire copy of the Tensor.

Now, many people aren’t aware of a Tensor method from PyTorch calledshare_memory_()然而,这个功能是什么触发了整个重建该特定张量的存储内存。什么这种方法确实是创建共享存储器的区域,可以不同的过程中被使用。该功能将在年底,拨打以下这个如下功能:

静态THStorage * THPStorage_(newFilenameStorage)(ptrdiff_t的大小){INT标志= TH_ALLOCATOR_MAPPED_SHAREDMEM |TH_ALLOCATOR_MAPPED_EXCLUSIVE;的std :: string手柄= THPStorage _(__ newHandle)();自动CTX = libshm_context_new(NULL,handle.c_str(),标志);返回THStorage_(newWithAllocator)(大小,&THManagedSharedAllocator,(无效*)CTX);}

(Code fromStorageSharing.cpp)

正如你所看到的,这个功能将创建使用称为特殊分配器另一个存储THManagedSharedAllocator。该功能首先定义一些标志,然后它创建一个手柄,其在格式的字符串/ torch_ [进程id] _ [随机数],之后,它就会使用特殊创建一个新的存储THManagedSharedAllocator。该分配器具有函数指针称为内部PyTorch库libshm, that will implement aUnix域套接字communication to share the shared memory region handles. This allocator is actual an especial case and it is a kind of “smart allocator” because it contains the communication control logic as well as it uses another allocator calledTHRefcountedMapAllocator这将是负责创建实际的共享内存区域和呼叫mmap()于该区域映射到进程的虚拟地址空间。

Note:当与一个PyTorch下划线的方法结束时,如该方法称为share_memory_(), it means that this method has an in-place effect, and it will change the current object instead of creating a new one with the modifications.

I’ll now show a Python example of one processing using the data from a Tensor that was allocated on another process by manually exchanging the shared memory handle:

This is executed in the process A:

>>>进口炬>>> tensor_a = torch.ones((5,5))>>> tensor_a 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [尺寸5×5的torch.FloatTensor] >>> tensor_a.is_shared()假>>> tensor_a = tensor_a.share_memory_()>>> tensor_a.is_shared()真>>> tensor_a_storage = tensor_a.storage()>>> tensor_a_storage。_share_filename_()(b '/ var / tmp中/ tmp.0.yowqlr',b '/ torch_31258_1218748506',25)

In this code, executed in theprocess A我们创建充满的人的5×5的一个新的张量。之后,我们让共享和打印与Unix域套接字地址元组以及手柄。现在,我们可以从另一个访问该存储区域process Bas shown below:

码的过程中B所执行:

>>> import torch >>> tensor_a = torch.Tensor() >>> tuple_info = (b'/var/tmp/tmp.0.yowqlr', b'/torch_31258_1218748506', 25) >>> storage = torch.Storage._new_shared_filename(*tuple_info) >>> tensor_a = torch.Tensor(storage).view((5, 5)) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [torch.FloatTensor of size 5x5]

正如你所看到的,使用关于Unix域套接字的地址和手柄的元组信息,我们能够从另一个访问的过程中张量存储。如果您更改此张量process B, you’ll also see that it will reflect in theprocess A因为这些张量共享相同的存储区。

DLPack: a hope for the Deep Learning frameworks Babel

现在我想谈谈在PyTorch代码库的东西最近,被称为DLPack。DLPack是一个内存张量结构,其将允许交换张量数据的一个开放的标准化between frameworks,什么是相当有趣的是,因为这个内存的表示是标准化和非常相似的内存中表示已经通过许多框架的使用,这将允许亚洲金博宝框架之间的零拷贝数据共享,这是考虑到各种框架,我们今天有没有它们之间相互通信的相当惊人的举措。

This will certainly help to overcome the “island model” that we have today between tensor representations in MXNet, PyTorch, etc, and will allow developers to mix framework operations between frameworks and all the benefits that a standardization can bring to the frameworks.

DLPack的核心操作系统被称为结构非常简单亚洲金博宝DLTensor, as shown below:

/ *!* \简短的纯C张量的对象,不管理内存。* / typedef结构{/ *!* \介绍不透明数据指针指向分配的数据。*这将在CUDA的OpenCL设备指针或cl_mem手柄。*这个指针始终对齐到256个字节的CUDA。* / void *的数据;/ *!\介绍张量* / DLContext CTX的设备上下文/ *! \brief Number of dimensions */ int ndim; /*! \brief The data type of the pointer*/ DLDataType dtype; /*! \brief The shape of the tensor */ int64_t* shape; /*! * \brief strides of the tensor, * can be NULL, indicating tensor is compact. */ int64_t* strides; /*! \brief The offset in bytes to the beginning pointer to data */ uint64_t byte_offset; } DLTensor;

(code fromdlpack.h)

正如可以看到,对于原始数据的数据指示字,以及形状/步幅/偏移/ GPU VS CPU,和其它元数据的信息有关的数据,该DLTensorpointing to.

There is also a managed version of the tensor that is calledDLManagedTensor, where the frameworks can provide a context and also a “deleter” function that can be called by the framework who borrowed the Tensor to inform the other framework that the resources are no longer required.

在PyTorch,如果要转换或从DLTensor格式,你可以找到这样做,甚至在Python中,你可以做如下所示,两个C / C ++的方法:

import torch from torch.utils import dlpack t = torch.ones((5, 5)) dl = dlpack.to_dlpack(t)

This Python function will call thetoDLPackfunction from ATen, shown below:

DLManagedTensor* toDLPack(const Tensor& src) { ATenDLMTensor * atDLMTensor(new ATenDLMTensor); atDLMTensor->handle = src; atDLMTensor->tensor.manager_ctx = atDLMTensor; atDLMTensor->tensor.deleter = &deleter; atDLMTensor->tensor.dl_tensor.data = src.data_ptr(); int64_t device_id = 0; if (src.type().is_cuda()) { device_id = src.get_device(); } atDLMTensor->tensor.dl_tensor.ctx = getDLContext(src.type(), device_id); atDLMTensor->tensor.dl_tensor.ndim = src.dim(); atDLMTensor->tensor.dl_tensor.dtype = getDLDataType(src.type()); atDLMTensor->tensor.dl_tensor.shape = const_cast(src.sizes().data()); atDLMTensor->tensor.dl_tensor.strides = const_cast(src.strides().data()); atDLMTensor->tensor.dl_tensor.byte_offset = 0; return &(atDLMTensor->tensor); }

As you can see, it’s a pretty simple conversion, casting the metadata from the PyTorch format to the DLPack format and assigning a pointer to the internal Tensor data representation.

我真的希望有更多的框架,采用这种标准,必将给生态效益。这也是有趣的是,与潜在的整合Apache的箭将是惊人的。

That’s it, I hope you liked this long post !

- 基督教S. Perone

Cite this article as: Christian S. Perone, "PyTorch – Internal Architecture Tour," inTerra Incognita,2018年12月3日,//www.cpetem.com/2018/03/pytorch-internal-architecture-tour/

13个想法“PyTorch - 内部建筑之旅”

  1. Great post! Very interesting to see the details of Pytorch as well as to know that it is well-implemented.

  2. 好贴!不过,我想你最好添加源代码版本,因为底层后端正在迅速发生变化,有些链接已经断开。

  3. hi Christian, thanks for the insider details on pytorch.

    我有一个问题,从pytorch到numpy的转换,并希望你能帮助我了解发生了什么,以及如何解决它。

    简单地说,我转换数组pytorch,执行一个过程,然后转换回numpy的用于使用OpenCV的后续处理。

    例:
    torch_array = torch.from_numpy(numpy_array)#小于1毫秒
    做torch_array#处理不到1毫秒施加GPU @ 99%
    numpy_array = np.array(torch_array) # greater than 200 msec

    GPU = nvidia on jetson TX1 platform
    火炬= 0.4.0

    关于^ h

    1. You should use .numpy().

      torch_array = torch.from_numpy(numpy_array)
      ...。
      ...。
      numpy_array = torch_array.numpy()

  4. 谢谢你为一个伟大的帖子!这真的帮助我理解张量存储是如何工作的。现在我可以检查两个张量共享相同的存储(由`t0.storage()。date_ptr()== t1.storage()。DATA_PTR()`),但我怎么能检查是否numpy的阵列的视图张量?有没有办法做到PyTorch和numpy的之间类似的检查?谢谢你的忠告提前!

    1. 您可以使用:n_array .__ array_interface __ [“数据”],然而,这仅仅是用于说明目的,因为比较原始指针不是一个很好的主意。亚洲金博宝

  5. 伟大的职位,但它确实帮助了我很多理解Pytorch存储。

    我的理解是,我可以从一个STL向量C ++ pytorch张量,并通过pybind暴露到Python无副本。

    我不知道如果我能揭露从C ++ STL的一个载体向Python和从它创建一个张没有进行复印,尽管https://pytorch.org/docs/stable/tensors.html说torch.tensor总是拷贝数据

Leave a Reply to埃里克取消回复

Your email address will not be published.

本网站使用的Akismet,以减少垃圾邮件。了解您的意见如何处理数据