PyTorch – Internal Architecture Tour

Update 28 Feb 2019:我添加了一个new blog post with a slide deckContaining the presentation I did for PyData Montreal.

介绍

这篇文章是围绕PyTorch代码库参观,它的目的是为PyTorch和其内部的建筑设计指导。我的主要目标是提供一些对于那些谁有兴趣了解发生什么事超出了面向用户的API,并展示一些新的东西超出了已经包含在其他的教程非常有用。

注意:PyTorch build system uses code generation extensively so I won’t repeat here what was already described by others. If you’re interested in understanding how this works, please read the following tutorials:

小号Hort intro to Python extension objects in C/C++

正如你可能知道,你可以使用C和C的Python扩展++和发展所谓的“延伸”。所有PyTorch繁重的工作在C / C ++,而不是纯Python实现。在C / C定义一个新的Python对象类型++,可以定义像低于该一个例子的结构(其是用于autograd基Variable类):

// Python object that backs torch.autograd.Variable struct THPVariable { PyObject_HEAD torch::autograd::Variable cdata; PyObject* backward_hooks; };

正如你所看到的,有在定义的开始宏,叫PyObject_HEAD,this macro’s goal is the standardization of Python objects and will expand to another structure that contains a pointer to a type object (which defines initialization methods, allocators, etc) and also a field with a reference counter.

该re are two extra macros in the Python API calledPy_INCREF()andPy_DECREF(),这是用来递增和递减Python对象的引用计数器。多个实体可以借用或自己到其他对象的引用(引用计数器增加),只有当这个引用计数器达到零(当所有的引用被摧毁),Python将自动使用其垃圾回收对象中删除的记忆。

你可以阅读更多有关Python C / ++扩展Here

滑稽的事实: it is very common in many applications to use small integer numbers as indexing, counters, etc. For efficiency, the officialC蟒蛇interpreter从-5高达256缓存整数出于这个原因,语句a = 200; b = 200; a is bwill be真正,while the statement一个= 300;B = 300;A是Bwill beFalse

Zero-copy PyTorch Tensor to Numpy and vice-versa

PyTorch有它自己的张量表示,该解耦对外交涉PyTorch内部表示。然而,因为它是很常见的,尤其是当数据亚洲金博宝从多种来源,装载有numpy的阵列无处不在,所以我们真的需要作出与NumPy和PyTorch张量之间的转换。出于这个原因,PyTorch提供了两个方法叫from_numpy()andnumpy的(),that converts a Numpy array to a PyTorch array and vice-versa, respectively. If we look the code that is being called to convert a Numpy array into a PyTorch tensor, we can get more insights on the PyTorch’s internal representation:

at::Tensor tensor_from_numpy(PyObject* obj) { if (!PyArray_Check(obj)) { throw TypeError("expected np.ndarray (got %s)", Py_TYPE(obj)->tp_name); } auto array = (PyArrayObject*)obj; int ndim = PyArray_NDIM(array); auto sizes = to_aten_shape(ndim, PyArray_DIMS(array)); auto strides = to_aten_shape(ndim, PyArray_STRIDES(array)); // NumPy strides use bytes. Torch strides use element counts. auto element_size_in_bytes = PyArray_ITEMSIZE(array); for (auto& stride : strides) { stride /= element_size_in_bytes; } // (...) - omitted for brevity void* data_ptr = PyArray_DATA(array); auto& type = CPU(dtype_to_aten(PyArray_TYPE(array))); Py_INCREF(obj); return type.tensorFromBlob(data_ptr, sizes, strides, [obj](void* data) { AutoGIL gil; Py_DECREF(obj); }); }

(代码从tensor_numpy.cpp

正如你可以从这个代码中看到,PyTorch是获得来自numpy的表示所有的信息(数组元数据),然后创建自己的。然而,如可以从标记线18注意,PyTorch是得到一个指针到内部numpy的阵列的原始数据而不是复制。这意味着PyTorch将创建该数据的参考,与用于原始数据张量的numpy的数组对象共享相同的存储器区域。

该re is also an important point here: when Numpy array object goes out of scope and get a zero reference count, it will be garbage collected anddestroyed,that’s why there is an increment in the reference counting of the Numpy array object at line 20.

After this, PyTorch will create a new Tensor object from this Numpy data blob, and in the creation of this new Tensor it passes the borrowed memory data pointer, together with the memory size and strides as well as a function that will be used later by the Tensor Storage (we’ll discuss this in the next section) to release the data by decrementing the reference counting to the Numpy array object and let Python take care of this object life cycle.

tensorFromBlob()method will create a new Tensor, but only after creating a new “Storage” for this Tensor. The storage is where the actual data pointer will be stored (and not in the Tensor structure itself). This takes us to the next section about张量存储s

张量存储

该actual raw data of the Tensor is not directly kept in the Tensor structure, but on another structure called Storage, which in turn is part of the Tensor structure.

As we saw in the previous code fromtensor_from_numpy(),有一个呼叫tensorFromBlob()将从原始数据blob创建一个张量。这最后一个功能将调用另外一个函数storageFromBlob(),这将反过来,对于这个数据根据其类型创建存储。在CPU浮点类型的情况下,它会返回一个新的CPUFloatStorage实例。

该CPUFloatStorage基本上是与各地实际的存储结构的效用函数的包装称为ŤHFloatStoragethat we show below:

typedef struct THStorage { real *data; ptrdiff_t size; int refcount; char flag; THAllocator *allocator; void *allocatorContext; struct THStorage *view; } THStorage;

(代码从THStorage.h

As you can see, theTHStorageHolds a pointer to the raw data, its size, flags and also an interesting field calledallocator我们马上要讨论的。同样重要的是要注意,关于如何解释里面的数据没有元数据THStorage,这是由于这样的事实,存储是“哑”关于它的内容,它是张量的责任要懂得“视图”或解释这个数据。

从这里,你可能已经意识到这一点的话,我们的can have multiple tensors pointing to the same storage but with different views of this data, and that’s why viewing a tensor with a different shape (but keeping the same number of elements) is so efficient. This Python code below shows that the data pointer in the storage is being shared after changing the way Tensor views its data:

>>> tensor_a = torch.ones((3,3))>>> tensor_b = tensor_a.view(9)>>> tensor_a.storage()。DATA_PTR()== tensor_b.storage()。DATA_PTR()真

As we can see in the example above, the data pointer on the storage of both Tensors are the same, but the Tensors represent a different interpretation of the storage data.

现在,我们的7号线锯ŤHFloatStorage结构中,存在一个指向ŤHAllocator构建那里。而且因为它带来了关于可用亚洲金博宝于分配存储数据分配器的灵活性,这是非常重要的。这种结构通过下面的代码表示:

typedef结构THAllocator {无效*(* malloc的)(无效*,ptrdiff_t的);无效*(* realloc的)(无效*,无效*,ptrdiff_t的);空隙(*免费)(无效*,无效*);} THAllocator;

(代码从ŤHAllocator.h

如您所见,有三个函数指针fields in this structure to define what an allocator means: a malloc, realloc and free. For CPU-allocated memory, these functions will, of course, relate to the traditional malloc/realloc/free POSIX functions, however, when we want a storage allocated on GPUs we’ll end up using the CUDA allocators such as theC乌达MallocHost(),就像我们可以在看THCudaHostAllocatormalloc function below:

static void *THCudaHostAllocator_malloc(void* ctx, ptrdiff_t size) { void* ptr; if (size < 0) THError("Invalid memory size: %ld", size); if (size == 0) return NULL; THCudaCheck(cudaMallocHost(&ptr, size)); return ptr; }

(代码从THCAllocator.c

你可能注意到在库组织模式,但要记住这些公约是很重要的导航信息库时,这里总结(取自PyTorch lib readme):

  • ŤH=ŤorcH
  • THC=ŤorcHC乌达
  • THC小号=ŤorcHC乌达小号parse
  • THCUNN=ŤorcHCUdañeuralñetwork
  • THD=ŤorcHdistributed
  • ŤHññ=ŤorcHñeuralñetwork
  • THS=ŤorcH小号parse

ŤHis convention is also present in the function/class names and other objects, so it is important to always keep these patterns in mind. While you can find CPU allocators in the TH code, you’ll find CUDA allocators in the THC code.

Finally, we can see the composition of the main TensorTHTensor结构体:

typedef struct THTensor { int64_t *size; int64_t *stride; int nDimension; THStorage *storage; ptrdiff_t storageOffset; int refcount; char flag; } THTensor;

(Code fromTHTensor。H

And as you can see, the mainTHTensorstructure holds the size/strides/dimensions/offsets/etc as well as the storage (THStorage)for the Tensor data.

我们可以概括所有这种结构,我们在下图中看到:

ñow, once we have requirements such as multi-processing where we want to share tensor data among multiple different processes, we need a shared memory approach to solve it, otherwise, every time another process needs a tensor or even when you want to implementHogwildtraining procedure where all different processes will write to the same memory region (where the parameters are), you’ll need to make copies between processes, and this is very inefficient. Therefore we’ll discuss in the next section a special kind of storage for Shared Memory.

小号Hared Memory

共享内存可以根据平台支持多种不同的方式来实现。PyTorch支持其中的一些,但是为了简单起见,我会在这里讨论关于使用CPU(而非GPU)的MacOS会发生什么。由于PyTorch支持多种共享存储器方法,这部分是有点棘手把握成,因为它涉及间接在代码多个级别。

PyTorch周围提供了Python的包装multiprocessingmodule and can be imported fromtorch.multiprocessing。他们在各地官方Python多这个包装实施的变化做是为了确保每次张量放在一个队列或共享与另一个进程,PyTorch将确保只对共享内存的句柄将被共享,而不是亚洲金博宝张量的新的完整副本。

现在,很多人不知道从PyTorch张量方法称为share_memory_(),However, this function is what triggers an entire rebuild of the storage memory for that particular Tensor. What this method does is to create a region of shared memory that can be used among different processes. This function will, in the end, call this following function below:

static THStorage* THPStorage_(newFilenameStorage)(ptrdiff_t size) { int flags = TH_ALLOCATOR_MAPPED_SHAREDMEM | TH_ALLOCATOR_MAPPED_EXCLUSIVE; std::string handle = THPStorage_(__newHandle)(); auto ctx = libshm_context_new(NULL, handle.c_str(), flags); return THStorage_(newWithAllocator)(size, &THManagedSharedAllocator, (void*)ctx); }

(Code from小号torageSharing.cpp

And as you can see, this function will create another storage using a special allocator calledŤHManagedSharedAllocator。ŤHis function first defines some flags and then it creates a handle which is a string in the format/torch_[process id]_[random number],之后,它就会使用特殊创建一个新的存储ŤHManagedSharedAllocator。ŤHis allocator has function pointers to an internal PyTorch library calledlibshm,that will implement aUnix域套接字通信共享所述共享存储器区域的把手。这是分配的实际情况特殊,它是一种“智能分配器”,因为它包含了通信控制逻辑以及它使用另一个名为分配器THRefcountedMapAllocatorthat will be responsible for creating the actual shared memory region and callmmap()to map this region to the process virtual address space.

注意:当与一个PyTorch下划线的方法结束时,如该方法称为share_memory_(),it means that this method has an in-place effect, and it will change the current object instead of creating a new one with the modifications.

我现在将展示使用来自张量,是由手动交换共享存储器手柄上的另一过程中分配的数据的一个处理的一个Python例如:

这是在处理A执行:

>>>进口炬>>> tensor_a = torch.ones((5,5))>>> tensor_a 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [尺寸5×5的torch.FloatTensor] >>> tensor_a.is_shared()假>>> tensor_a = tensor_a.share_memory_()>>> tensor_a.is_shared()真>>> tensor_a_storage = tensor_a.storage()>>> tensor_a_storage。_share_filename_()(b '/ var / tmp中/ tmp.0.yowqlr',b '/ torch_31258_1218748506',25)

In this code, executed in the进程A我们创建充满的人的5×5的一个新的张量。之后,我们让共享和打印与Unix域套接字地址元组以及手柄。现在,我们可以从另一个访问该存储区域进程B如下所示:

码的过程中B所执行:

>>> import torch >>> tensor_a = torch.Tensor() >>> tuple_info = (b'/var/tmp/tmp.0.yowqlr', b'/torch_31258_1218748506', 25) >>> storage = torch.Storage._new_shared_filename(*tuple_info) >>> tensor_a = torch.Tensor(storage).view((5, 5)) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [torch.FloatTensor of size 5x5]

正如你所看到的,使用关于Unix域套接字的地址和手柄的元组信息,我们能够从另一个访问的过程中张量存储。如果您更改此张量进程B,你也会看到它的反映进程A因为这些张量共享相同的存储区。

DLPack:对于深度学习一个希望构架巴贝尔

现在我想谈谈在PyTorch代码库的东西最近,被称为dLPack。DLPack是一个内存张量结构,其将允许交换张量数据的一个开放的标准化between frameworks,and what is quite interesting is that since this memory representation is standardized and very similar to the memory representation already in use by many frameworks, it will allow azero-copy data sharing between frameworks,which is a quite amazing initiative given the variety of frameworks we have today without inter-communication among them.

ŤHis will certainly help to overcome the “island model” that we have today between tensor representations in MXNet, PyTorch, etc, and will allow developers to mix framework operations between frameworks and all the benefits that a standardization can bring to the frameworks.

DLPack的核心操作系统被称为结构非常简单亚洲金博宝DLTensor,如下所示:

/*! * \brief Plain C Tensor object, does not manage memory. */ typedef struct { /*! * \brief The opaque data pointer points to the allocated data. * This will be CUDA device pointer or cl_mem handle in OpenCL. * This pointer is always aligns to 256 bytes as in CUDA. */ void* data; /*! \brief The device context of the tensor */ DLContext ctx; /*! \brief Number of dimensions */ int ndim; /*! \brief The data type of the pointer*/ DLDataType dtype; /*! \brief The shape of the tensor */ int64_t* shape; /*! * \brief strides of the tensor, * can be NULL, indicating tensor is compact. */ int64_t* strides; /*! \brief The offset in bytes to the beginning pointer to data */ uint64_t byte_offset; } DLTensor;

(代码从dlpack.h

As you can see, there is a data pointer for the raw data as well as shape/stride/offset/GPU vs CPU, and other metadata information about the data that theDLTensor指向。

还有那个叫做张量的托管版本DLManagedTensor,where the frameworks can provide a context and also a “deleter” function that can be called by the framework who borrowed the Tensor to inform the other framework that the resources are no longer required.

In PyTorch, if you want to convert to or from a DLTensor format, you can find both C/C++ methods for doing that or even in Python you can do that as shown below:

import torch from torch.utils import dlpack t = torch.ones((5, 5)) dl = dlpack.to_dlpack(t)

ŤHis Python function will call thetoDLPackfunction from ATen, shown below:

DLManagedTensor* toDLPack(const Tensor& src) { ATenDLMTensor * atDLMTensor(new ATenDLMTensor); atDLMTensor->handle = src; atDLMTensor->tensor.manager_ctx = atDLMTensor; atDLMTensor->tensor.deleter = &deleter; atDLMTensor->tensor.dl_tensor.data = src.data_ptr(); int64_t device_id = 0; if (src.type().is_cuda()) { device_id = src.get_device(); } atDLMTensor->tensor.dl_tensor.ctx = getDLContext(src.type(), device_id); atDLMTensor->tensor.dl_tensor.ndim = src.dim(); atDLMTensor->tensor.dl_tensor.dtype = getDLDataType(src.type()); atDLMTensor->tensor.dl_tensor.shape = const_cast(src.sizes().data()); atDLMTensor->tensor.dl_tensor.strides = const_cast(src.strides().data()); atDLMTensor->tensor.dl_tensor.byte_offset = 0; return &(atDLMTensor->tensor); }

As you can see, it’s a pretty simple conversion, casting the metadata from the PyTorch format to the DLPack format and assigning a pointer to the internal Tensor data representation.

I really hope that more frameworks adopt this standard that will certainly give benefits to the ecosystem. It is also interesting to note that a potential integration withApache的箭would be amazing.

ŤHat’s it, I hope you liked this long post !

– Christian S. Perone

Cite this article as: Christian S. Perone, "PyTorch – Internal Architecture Tour," in亚洲金博宝未知领域,2018年12月3日,//www.cpetem.com/2018/03/pytorch-internal-architecture-tour/

13 thoughts to “PyTorch – Internal Architecture Tour”

  1. 伟大的职位!亚洲金博宝非常有趣的,看看Pytorch的细节,以及知道它是良好的实现。

  2. ñice post! however, I think you’d better add the source code version since the underlying backend is changing fast and some links are already broken.

  3. 喜基督徒,感谢上pytorch的内幕细节。

    我有一个问题,从pytorch到numpy的转换,并希望你能帮助我了解发生了什么,以及如何解决它。

    简单地说,我转换数组pytorch,执行一个过程,然后转换回numpy的用于使用OpenCV的后续处理。

    例:
    torch_array = torch.from_numpy(numpy_array)#小于1毫秒
    做torch_array#处理不到1毫秒施加GPU @ 99%
    numpy_array = np.array(torch_array)#大于200毫秒

    GPU = nvidia on jetson TX1 platform
    火炬= 0.4.0

    regards h

    1. 您应该使用.numpy()。

      torch_array = torch.from_numpy(numpy_array)
      ….
      ….
      numpy_array = torch_array.numpy()

  4. Well Written! Now I know more about pytorch internals, how it represent/store tensors

  5. 谢谢你为一个伟大的帖子!这真的帮助我理解张量存储是如何工作的。现在我可以检查两个张量共享相同的存储(由`t0.storage()。date_ptr()== t1.storage()。DATA_PTR()`),但我怎么能检查是否numpy的阵列的视图张量?有没有办法做到PyTorch和numpy的之间类似的检查?谢谢你的忠告提前!

    1. 您可以使用:n_array .__ array_interface __ [“数据”],然而,这仅仅是用于说明目的,因为比较原始指针不是一个很好的主意。亚洲金博宝

  6. Great post, it did help me a lot understanding Pytorch storage.

    My understanding is that I can create a C++ pytorch tensor from a STL vector and expose it to Python through pybind without copies.

    I wonder If I could expose a STL vector from C++ into Python and create a tensor from it without making copies, despite the fact thathttps://pytorch.org/docs/stable/tensors.htmlsays torch.tensor always copies data

Leave a Reply todave KielpinskiCancel reply

Your email address will not be published.

本网站使用的Akismet,以减少垃圾邮件。Learn how your comment data is processed