PyTorch – Internal Architecture Tour

Update 28 Feb 2019:我添加了一个新的博客文章用幻灯片平台Containing the presentation I did for PyData Montreal.

介绍

这篇文章是围绕PyTorch代码库参观,它的目的是为PyTorch和其内部的建筑设计指导。我的主要目标是提供一些对于那些谁有兴趣了解发生什么事超出了面向用户的API,并展示一些新的东西超出了已经包含在其他的教程非常有用。

注意:PyTorch build system uses code generation extensively so I won’t repeat here what was already described by others. If you’re interested in understanding how this works, please read the following tutorials:

小号Hort intro to Python extension objects in C/C++

正如你可能知道,你可以使用C和C的Python扩展++和发展所谓的“延伸”。所有PyTorch繁重的工作在C / C ++,而不是纯Python实现。在C / C定义一个新的Python对象类型++,可以定义像低于该一个例子的结构(其是用于autograd基Variable类):

// Python object that backs torch.autograd.Variable struct THPVariable { PyObject_HEAD torch::autograd::Variable cdata; PyObject* backward_hooks; };

正如你所看到的,有在定义的开始宏,叫PyObject_HEAD,该宏的目标是Python中的标准化的对象和将扩大到包含一个指向类型的对象(其限定初始化方法,分配器等),也与参考计数器的场另一种结构。

有Python的API中的两个额外的宏叫Py_INCREF()andPy_DECREF(),这是用来递增和递减Python对象的引用计数器。多个实体可以借用或自己到其他对象的引用(引用计数器增加),只有当这个引用计数器达到零(当所有的引用被摧毁),Python将自动使用其垃圾回收对象中删除的记忆。

你可以阅读更多有关Python C / ++扩展Here

滑稽的事实: it is very common in many applications to use small integer numbers as indexing, counters, etc. For efficiency, the officialCPython的解释从-5高达256缓存整数出于这个原因,语句一个= 200;B = 200;A是Bwill be真正,while the statement一个= 300;B = 300;A是Bwill beFalse

Zero-copy PyTorch Tensor to Numpy and vice-versa

PyTorch有它自己的张量表示,该解耦对外交涉PyTorch内部表示。然而,因为它是很常见的,尤其是当数据亚洲金博宝从多种来源,装载有numpy的阵列无处不在,所以我们真的需要作出与NumPy和PyTorch张量之间的转换。出于这个原因,PyTorch提供了两个方法叫from_numpy()andnumpy的(),that converts a Numpy array to a PyTorch array and vice-versa, respectively. If we look the code that is being called to convert a Numpy array into a PyTorch tensor, we can get more insights on the PyTorch’s internal representation:

at::Tensor tensor_from_numpy(PyObject* obj) { if (!PyArray_Check(obj)) { throw TypeError("expected np.ndarray (got %s)", Py_TYPE(obj)->tp_name); } auto array = (PyArrayObject*)obj; int ndim = PyArray_NDIM(array); auto sizes = to_aten_shape(ndim, PyArray_DIMS(array)); auto strides = to_aten_shape(ndim, PyArray_STRIDES(array)); // NumPy strides use bytes. Torch strides use element counts. auto element_size_in_bytes = PyArray_ITEMSIZE(array); for (auto& stride : strides) { stride /= element_size_in_bytes; } // (...) - omitted for brevity void* data_ptr = PyArray_DATA(array); auto& type = CPU(dtype_to_aten(PyArray_TYPE(array))); Py_INCREF(obj); return type.tensorFromBlob(data_ptr, sizes, strides, [obj](void* data) { AutoGIL gil; Py_DECREF(obj); }); }

(代码从tensor_numpy.cpp

正如你可以从这个代码中看到,PyTorch是获得来自numpy的表示所有的信息(数组元数据),然后创建自己的。然而,如可以从标记线18注意,PyTorch是得到一个指针到内部numpy的阵列的原始数据而不是复制。这意味着PyTorch将创建该数据的参考,与用于原始数据张量的numpy的数组对象共享相同的存储器区域。

此外,还有一个重要的点这里:当numpy的数组对象超出范围了,并得到一个零引用计数,它会被垃圾收集和销毁,that’s why there is an increment in the reference counting of the Numpy array object at line 20.

在此之后,PyTorch将创建从该numpy的数据blob一个新的张量对象,并且在创建新张量的它与存储器的大小和进展以及稍后将被使用的功能通过借用存储数据指针,一起the Tensor Storage (we’ll discuss this in the next section) to release the data by decrementing the reference counting to the Numpy array object and let Python take care of this object life cycle.

tensorFromBlob()method will create a new Tensor, but only after creating a new “Storage” for this Tensor. The storage is where the actual data pointer will be stored (and not in the Tensor structure itself). This takes us to the next section about张量存储s

张量存储

张量的实际原始数据不直接保存在张量结构,但在另一种结构称为存储,这又是张量结构的一部分。

As we saw in the previous code fromtensor_from_numpy(),有一个呼叫tensorFromBlob()将从原始数据blob创建一个张量。这最后一个功能将调用另外一个函数storageFromBlob(),这将反过来,对于这个数据根据其类型创建存储。在CPU浮点类型的情况下,它会返回一个新的CPUFloatStorage实例。

该CPUFloatStorage基本上是与各地实际的存储结构的效用函数的包装称为THFloatStoragethat we show below:

typedef结构THStorage {真实*数据;ptrdiff_t的大小;INT引用计数;焦标志;THAllocator *分配器;无效* allocatorContext;STRUCT THStorage *图;} THStorage;

(代码从THStorage.h

As you can see, theTHStorage具有指向原始数据,它的尺寸,标志和也是一个有趣的领域被称为allocator我们马上要讨论的。同样重要的是要注意,关于如何解释里面的数据没有元数据THStorage,这是由于这样的事实,存储是“哑”关于它的内容,它是张量的责任要懂得“视图”或解释这个数据。

由此看来,你已经大概意识到,我们可以有指向同一个存储,但与此不同的数据视图多张量,这就是为什么有不同的形状观看张量(但保持相同数量的元素),效率非常高。此Python代码下面示出的是,在存储器中的数据指针被改变张量认为其数据的方式后共享:

>>> tensor_a = torch.ones((3,3))>>> tensor_b = tensor_a.view(9)>>> tensor_a.storage()。DATA_PTR()== tensor_b.storage()。DATA_PTR()真

正如我们可以在上面的例子中看到的,在两个张量的存储数据指针是相同的,但张量表示存储的数据的不同的解释。

现在,我们的7号线锯THFloatStorage结构中,存在一个指向THAllocator构建那里。而且因为它带来了关于可用亚洲金博宝于分配存储数据分配器的灵活性,这是非常重要的。这种结构通过下面的代码表示:

typedef结构THAllocator {无效*(* malloc的)(无效*,ptrdiff_t的);无效*(* realloc的)(无效*,无效*,ptrdiff_t的);空隙(*免费)(无效*,无效*);} THAllocator;

(代码从THAllocator.h

如您所见,有三个函数指针fields in this structure to define what an allocator means: a malloc, realloc and free. For CPU-allocated memory, these functions will, of course, relate to the traditional malloc/realloc/free POSIX functions, however, when we want a storage allocated on GPUs we’ll end up using the CUDA allocators such as theC乌达MallocHost(),就像我们可以在看THCudaHostAllocatormalloc function below:

static void *THCudaHostAllocator_malloc(void* ctx, ptrdiff_t size) { void* ptr; if (size < 0) THError("Invalid memory size: %ld", size); if (size == 0) return NULL; THCudaCheck(cudaMallocHost(&ptr, size)); return ptr; }

(代码从THCAllocator.c

你可能注意到在库组织模式,但要记住这些公约是很重要的导航信息库时,这里总结(取自PyTorch LIB自述):

  • TH=ŤorcH
  • THC=ŤorcHC乌达
  • THC小号=ŤorcHC乌达小号parse
  • THCUNN=ŤorcHCUDAñeuralñetwork
  • THD=ŤorcHdistributed
  • THññ=ŤorcHñeuralñetwork
  • THS=ŤorcH 2 Sparse

ŤHis convention is also present in the function/class names and other objects, so it is important to always keep these patterns in mind. While you can find CPU allocators in the TH code, you’ll find CUDA allocators in the THC code.

最后,我们可以看到主要的张量的组成THTensor结构体:

typedef结构THTensor {*的int64_t大小;*的int64_t步幅;INT n标注;THStorage *存储;ptrdiff_t的storageOffset;INT引用计数;焦标志;} THTensor;

(Code fromTHTensor。H

And as you can see, the mainTHTensor结构保持的尺寸/步幅/尺寸/偏移/等以及存储(THStorage),用于张量数据。

我们可以概括所有这种结构,我们在下图中看到:

ñow, once we have requirements such as multi-processing where we want to share tensor data among multiple different processes, we need a shared memory approach to solve it, otherwise, every time another process needs a tensor or even when you want to implementHogwildtraining procedure where all different processes will write to the same memory region (where the parameters are), you’ll need to make copies between processes, and this is very inefficient. Therefore we’ll discuss in the next section a special kind of storage for Shared Memory.

小号Hared Memory

共享内存可以根据平台支持多种不同的方式来实现。PyTorch支持其中的一些,但是为了简单起见,我会在这里讨论关于使用CPU(而非GPU)的MacOS会发生什么。由于PyTorch支持多种共享存储器方法,这部分是有点棘手把握成,因为它涉及间接在代码多个级别。

PyTorch周围提供了Python的包装module and can be imported fromtorch.multiprocessing。他们在各地官方Python多这个包装实施的变化做是为了确保每次张量放在一个队列或共享与另一个进程,PyTorch将确保只对共享内存的句柄将被共享,而不是亚洲金博宝张量的新的完整副本。

现在,很多人不知道从PyTorch张量方法称为share_memory_()然而,这个功能是什么触发了整个重建该特定张量的存储内存。什么这种方法确实是创建共享存储器的区域,可以不同的过程中被使用。该功能将在年底,拨打以下这个如下功能:

static THStorage* THPStorage_(newFilenameStorage)(ptrdiff_t size) { int flags = TH_ALLOCATOR_MAPPED_SHAREDMEM | TH_ALLOCATOR_MAPPED_EXCLUSIVE; std::string handle = THPStorage_(__newHandle)(); auto ctx = libshm_context_new(NULL, handle.c_str(), flags); return THStorage_(newWithAllocator)(size, &THManagedSharedAllocator, (void*)ctx); }

(Code from小号torageSharing.cpp

正如你所看到的,这个功能将创建使用称为特殊分配器另一个存储THManagedSharedAllocator。该功能首先定义一些标志,然后它创建一个手柄,其在格式的字符串/torch_[process id]_[random number],之后,它就会使用特殊创建一个新的存储THManagedSharedAllocator。该分配器具有函数指针称为内部PyTorch库libshm,that will implement aUnix域套接字通信共享所述共享存储器区域的把手。这是分配的实际情况特殊,它是一种“智能分配器”,因为它包含了通信控制逻辑以及它使用另一个名为分配器THRefcountedMapAllocator这将是负责创建实际的共享内存区域和呼叫mmap()to map this region to the process virtual address space.

注意:当与一个PyTorch下划线的方法结束时,如该方法称为share_memory_(),it means that this method has an in-place effect, and it will change the current object instead of creating a new one with the modifications.

我现在将展示使用来自张量,是由手动交换共享存储器手柄上的另一过程中分配的数据的一个处理的一个Python例如:

这是在处理A执行:

>>>进口炬>>> tensor_a = torch.ones((5,5))>>> tensor_a 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [尺寸5×5的torch.FloatTensor] >>> tensor_a.is_shared()假>>> tensor_a = tensor_a.share_memory_()>>> tensor_a.is_shared()真>>> tensor_a_storage = tensor_a.storage()>>> tensor_a_storage。_share_filename_()(b '/ var / tmp中/ tmp.0.yowqlr',b '/ torch_31258_1218748506',25)

In this code, executed in the进程A我们创建充满的人的5×5的一个新的张量。之后,我们让共享和打印与Unix域套接字地址元组以及手柄。现在,我们可以从另一个访问该存储区域进程B如下所示:

码的过程中B所执行:

>>> import torch >>> tensor_a = torch.Tensor() >>> tuple_info = (b'/var/tmp/tmp.0.yowqlr', b'/torch_31258_1218748506', 25) >>> storage = torch.Storage._new_shared_filename(*tuple_info) >>> tensor_a = torch.Tensor(storage).view((5, 5)) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [torch.FloatTensor of size 5x5]

正如你所看到的,使用关于Unix域套接字的地址和手柄的元组信息,我们能够从另一个访问的过程中张量存储。如果您更改此张量进程B,你也会看到它的反映进程A因为这些张量共享相同的存储区。

DLPack:对于深度学习一个希望构架巴贝尔

现在我想谈谈在PyTorch代码库的东西最近,被称为DLPack。DLPack是一个内存张量结构,其将允许交换张量数据的一个开放的标准化between frameworks,and what is quite interesting is that since this memory representation is standardized and very similar to the memory representation already in use by many frameworks, it will allow azero-copy data sharing between frameworks,which is a quite amazing initiative given the variety of frameworks we have today without inter-communication among them.

ŤHis will certainly help to overcome the “island model” that we have today between tensor representations in MXNet, PyTorch, etc, and will allow developers to mix framework operations between frameworks and all the benefits that a standardization can bring to the frameworks.

DLPack的核心操作系统被称为结构非常简单亚洲金博宝DLTensor,如下所示:

/ *!* \简短的纯C张量的对象,不管理内存。* / typedef结构{/ *!* \介绍不透明数据指针指向分配的数据。*这将在CUDA的OpenCL设备指针或cl_mem手柄。*这个指针始终对齐到256个字节的CUDA。* / void *的数据;/ *!\介绍张量* / DLContext CTX的设备上下文/ *! \brief Number of dimensions */ int ndim; /*! \brief The data type of the pointer*/ DLDataType dtype; /*! \brief The shape of the tensor */ int64_t* shape; /*! * \brief strides of the tensor, * can be NULL, indicating tensor is compact. */ int64_t* strides; /*! \brief The offset in bytes to the beginning pointer to data */ uint64_t byte_offset; } DLTensor;

(代码从dlpack.h

As you can see, there is a data pointer for the raw data as well as shape/stride/offset/GPU vs CPU, and other metadata information about the data that theDLTensor指向。

还有那个叫做张量的托管版本DLManagedTensor,where the frameworks can provide a context and also a “deleter” function that can be called by the framework who borrowed the Tensor to inform the other framework that the resources are no longer required.

In PyTorch, if you want to convert to or from a DLTensor format, you can find both C/C++ methods for doing that or even in Python you can do that as shown below:

import torch from torch.utils import dlpack t = torch.ones((5, 5)) dl = dlpack.to_dlpack(t)

ŤHis Python function will call thetoDLPackfunction from ATen, shown below:

DLManagedTensor* toDLPack(const Tensor& src) { ATenDLMTensor * atDLMTensor(new ATenDLMTensor); atDLMTensor->handle = src; atDLMTensor->tensor.manager_ctx = atDLMTensor; atDLMTensor->tensor.deleter = &deleter; atDLMTensor->tensor.dl_tensor.data = src.data_ptr(); int64_t device_id = 0; if (src.type().is_cuda()) { device_id = src.get_device(); } atDLMTensor->tensor.dl_tensor.ctx = getDLContext(src.type(), device_id); atDLMTensor->tensor.dl_tensor.ndim = src.dim(); atDLMTensor->tensor.dl_tensor.dtype = getDLDataType(src.type()); atDLMTensor->tensor.dl_tensor.shape = const_cast(src.sizes().data()); atDLMTensor->tensor.dl_tensor.strides = const_cast(src.strides().data()); atDLMTensor->tensor.dl_tensor.byte_offset = 0; return &(atDLMTensor->tensor); }

As you can see, it’s a pretty simple conversion, casting the metadata from the PyTorch format to the DLPack format and assigning a pointer to the internal Tensor data representation.

我真的希望有更多的框架,采用这种标准,必将给生态效益。这也是有趣的是,与潜在的整合Apache的箭将是惊人的。

ŤHat’s it, I hope you liked this long post !

– Christian S. Perone

Cite this article as: Christian S. Perone, "PyTorch – Internal Architecture Tour," in亚洲金博宝未知领域,2018年12月3日,//www.cpetem.com/2018/03/pytorch-internal-architecture-tour/

13个想法“PyTorch - 内部建筑之旅”

  1. 伟大的职位!亚洲金博宝非常有趣的,看看Pytorch的细节,以及知道它是良好的实现。

  2. 好贴!不过,我想你最好添加源代码版本,因为底层后端正在迅速发生变化,有些链接已经断开。

  3. 喜基督徒,感谢上pytorch的内幕细节。

    我有一个问题,从pytorch到numpy的转换,并希望你能帮助我了解发生了什么,以及如何解决它。

    简单地说,我转换数组pytorch,执行一个过程,然后转换回numpy的用于使用OpenCV的后续处理。

    例:
    torch_array = torch.from_numpy(numpy_array)#小于1毫秒
    做torch_array#处理不到1毫秒施加GPU @ 99%
    numpy_array = np.array(torch_array)#大于200毫秒

    GPU = nvidia on jetson TX1 platform
    火炬= 0.4.0

    关于^ h

    1. 您应该使用.numpy()。

      torch_array = torch.from_numpy(numpy_array)
      ...。
      ...。
      numpy_array = torch_array.numpy()

  4. Well Written! Now I know more about pytorch internals, how it represent/store tensors

  5. 谢谢你为一个伟大的帖子!这真的帮助我理解张量存储是如何工作的。现在我可以检查两个张量共享相同的存储(由`t0.storage()。date_ptr()== t1.storage()。DATA_PTR()`),但我怎么能检查是否numpy的阵列的视图张量?有没有办法做到PyTorch和numpy的之间类似的检查?谢谢你的忠告提前!

    1. 您可以使用:n_array .__ array_interface __ [“数据”],然而,这仅仅是用于说明目的,因为比较原始指针不是一个很好的主意。亚洲金博宝

  6. 伟大的职位,但它确实帮助了我很多理解Pytorch存储。

    我的理解是,我可以从一个STL向量C ++ pytorch张量,并通过pybind暴露到Python无副本。

    I wonder If I could expose a STL vector from C++ into Python and create a tensor from it without making copies, despite the fact thathttps://pytorch.org/docs/stable/tensors.htmlsays torch.tensor always copies data

Leave a Reply to弗拉季斯拉夫·KurenkovCancel reply

Your email address will not be published.

本网站使用的Akismet,以减少垃圾邮件。Learn how your comment data is processed