PyTorch – Internal Architecture Tour

Update 28 Feb 2019:I added a新博客帖子用幻灯片甲板包含我为Pydata蒙特利尔的演示文稿。


This post is a tour around the PyTorch codebase, it is meant to be a guide for the architectural design of PyTorch and its internals. My main goal is to provide something useful for those who are interested in understanding what happens beyond the user-facing API and show something new beyond what was already covered in other tutorials.

Note:Pytorch Build System广泛使用代码,因此我不会在此处重复其他人已经描述的内容。如果您有兴趣了解这项工作,请阅读以下教程:

C / C ++中的Python扩展对象的简介

正如你可能知道的,可以扩展Python使用C and C++ and develop what is called as “extension”. All the PyTorch heavy work is implemented in C/C++ instead of pure-Python. To define a new Python object type in C/C++, you define a structure like this one example below (which is the base for the autogradVariableclass):

//返回torch.autograd.variable structiable {pyobject_head torch :: autograd ::变量cdata;pyobject * backward_hooks;};

正如你所看到的那样,re is a macro at the beginning of the definition, calledPyObject_HEAD此,此宏的目标是Python对象的标准化,并将扩展到包含指向类型对象的指针的另一个结构(其定义初始化方法,分配器等)以及具有参考计数器的字段。

Python API中有两个额外的宏py_incref()andpy_decref(), which are used to increment and decrement the reference counter of Python objects. Multiple entities can borrow or own a reference to other objects (the reference counter is increased), and only when this reference counter reaches zero (when all references get destroyed), Python will automatically delete the memory from that object using its garbage collector.

You can read more about Python C/++ extensionshere

Funny fact:在许多应用程序亚洲金博宝中使用小整数数字作为索引,计数器等非常常见。cpython翻译caches the integers from -5 up to 256. For that reason, the statementa = 200;B = 200;A是B.True,虽然陈述a = 300; b = 300; a is b错误的

Zero-copy PyTorch Tensor to Numpy and vice-versa

PyTorch has its own Tensor representation, which decouples PyTorch internal representation from external representations. However, as it is very common, especially when data is loaded from a variety of sources, to have Numpy arrays everywhere, therefore we really need to make conversions between Numpy and PyTorch tensors. For that reason, PyTorch provides two methods calledfrom_numpy()andnumpy(),它分别将NUMPY数组转换为Pytorch阵列和反之亦然。如果我们查看正在调用的代码将numpy数组转换为pytorch tensor,我们可以对Pytorch的内部表示获得更多的见解:

at::Tensor tensor_from_numpy(PyObject* obj) { if (!PyArray_Check(obj)) { throw TypeError("expected np.ndarray (got %s)", Py_TYPE(obj)->tp_name); } auto array = (PyArrayObject*)obj; int ndim = PyArray_NDIM(array); auto sizes = to_aten_shape(ndim, PyArray_DIMS(array)); auto strides = to_aten_shape(ndim, PyArray_STRIDES(array)); // NumPy strides use bytes. Torch strides use element counts. auto element_size_in_bytes = PyArray_ITEMSIZE(array); for (auto& stride : strides) { stride /= element_size_in_bytes; } // (...) - omitted for brevity void* data_ptr = PyArray_DATA(array); auto& type = CPU(dtype_to_aten(PyArray_TYPE(array))); Py_INCREF(obj); return type.tensorFromBlob(data_ptr, sizes, strides, [obj](void* data) { AutoGIL gil; Py_DECREF(obj); }); }

(code fromtensor_numpy.cpp.)

As you can see from this code, PyTorch is obtaining all information (array metadata) from Numpy representation and then creating its own. However, as you can note from the marked line 18, PyTorch is getting a pointer to the internal Numpy array raw data instead of copying it. This means that PyTorch will create a reference for this data, sharing the same memory region with the Numpy array object for the raw Tensor data.


在此之后,Pytorch将从该Numpy数据斑点创建一个新的Tensor对象,并且在创建此新的Tensor时,它将借用的内存数据指针与内存大小和进步以及稍后将使用的功能一起使用the Tensor Storage (we’ll discuss this in the next section) to release the data by decrementing the reference counting to the Numpy array object and let Python take care of this object life cycle.

ThetensorFromBlob()method will create a new Tensor, but only after creating a new “Storage” for this Tensor. The storage is where the actual data pointer will be stored (and not in the Tensor structure itself). This takes us to the next section about张量议员

Tensor Storage


As we saw in the previous code fromtensor_from_numpy(), there is a call fortensorFromBlob()that will create a Tensor from the raw data blob. This last function will call another function called storageFromBlob() that will, in turn, create a storage for this data according to its type. In the case of a CPU float type, it will return a newcpufloatstorageinstance.

The CPUFloatStorage is basically a wrapper with utility functions around the actual storage structure calledthfloatstorage.that we show below:

typedef struct thstorage {real *数据;ptriff_t尺寸;INT REFCOUNT;魅旗;Thallocator *分配器;void * allocatorcontext;struct thstorage *视图;thstorage;

(code fromTH.Storage.h)

正如你所看到的那样,TH.Storage持有原始数据的指针,其大小,标志以及称为一个有趣的字段分配器that we’ll soon discuss. It is also important to note that there is no metadata regarding on how to interpret the data inside theTH.Storage, this is due to the fact that the storage is “dumb” regarding of its contents and it is the Tensor responsibility to know how to “view” or interpret this data.


>>> tensor_a = torch.ones((3, 3)) >>> tensor_b = tensor_a.view(9) >>> == True


Now, as we saw in line 7 of thethfloatstorage.structure, there is a pointer to aTH.Allocatorstructure there. And this is very important because it brings flexibility regarding the allocator that can be used to allocate the storage data. This structure is represented by the following code:

typedef struct THAllocator { void* (*malloc)(void*, ptrdiff_t); void* (*realloc)(void*, void*, ptrdiff_t); void (*free)(void*, void*); } THAllocator;

(code fromthallocator.h.)

如您所见,有三个函数指针fields in this structure to define what an allocator means: a malloc, realloc and free. For CPU-allocated memory, these functions will, of course, relate to the traditional malloc/realloc/free POSIX functions, however, when we want a storage allocated on GPUs we’ll end up using the CUDA allocators such as thecudamallochost(), like we can see in theTH.CudaHostAllocatormalloc function below:

static void *THCudaHostAllocator_malloc(void* ctx, ptrdiff_t size) { void* ptr; if (size < 0) THError("Invalid memory size: %ld", size); if (size == 0) return NULL; THCudaCheck(cudaMallocHost(&ptr, size)); return ptr; }

(code fromTH.CAllocator.c)

You probably noticed a pattern in the repository organization, but it is important to keep in mind these conventions when navigating the repository, as summarized here (taken from thepytorch lib readme.):

  • TH.=T兽人H
  • TH.C=T兽人HCuda
  • TH.CS=T兽人HCudaS解析
  • TH.铜NN=T兽人HNeural.N耶稣
  • TH.D=T兽人HD被分配
  • THNN.=T兽人HNeural.N耶稣
  • TH.S=T兽人H S.解析



typedef结构thtensor {int64_t * size;int64_t * stride;int ndimension;thstorage *储存;ptriff_t storageOffset;INT REFCOUNT;魅旗;thtentor;


尽你所能,主要是TH.Tensor结构保持尺寸/脚/尺寸/偏移/ etc等存储器(TH.Storage)对于张量数据。

We can summarize all this structure that we saw in the diagram below:


Shared Memory

Shared memory can be implemented in many different ways depending on the platform support. PyTorch supports some of them, but for the sake of simplicity, I’ll talk here about what happens on MacOS using the CPU (instead of GPU). Since PyTorch supports multiple shared memory approaches, this part is a little tricky to grasp into since it involves more levels of indirection in the code.

PyTorch provides a wrapper around the Python多处理模块,可以从中导入火炬.MultiProcessing.。The changes they implemented in this wrapper around the official Python multiprocessing were done to make sure that everytime a tensor is put on a queue or shared with another process, PyTorch will make sure that only a handle for the shared memory will be shared instead of a new entire copy of the Tensor.

Now, many people aren’t aware of a Tensor method from PyTorch calledshare_memory_()但是,此功能是触发器的整个重建存储内存以用于该特定张量。此方法所做的是创建可以在不同进程中使用的共享内存区域。最后,此功能将在下面调用以下功能:

static THStorage* THPStorage_(newFilenameStorage)(ptrdiff_t size) { int flags = TH_ALLOCATOR_MAPPED_SHAREDMEM | TH_ALLOCATOR_MAPPED_EXCLUSIVE; std::string handle = THPStorage_(__newHandle)(); auto ctx = libshm_context_new(NULL, handle.c_str(), flags); return THStorage_(newWithAllocator)(size, &THManagedSharedAllocator, (void*)ctx); }


如您所见,此功能将使用调用特殊分配器创建另一个存储TH.ManagedSharedAllocator。此功能首先定义一些标志,然后它创建一个格式的字符串/torch_[process id]_[random number], and after that, it will then create a new storage using the specialTH.ManagedSharedAllocator。此分配器对调用的内部Pytorch库具有函数指针libshm,这将实施一个Unix Domain Socketcommunication to share the shared memory region handles. This allocator is actual an especial case and it is a kind of “smart allocator” because it contains the communication control logic as well as it uses another allocator calledTH.RefcountedMapAllocator这将负责创建实际的共享内存区域并调用mmap()to map this region to the process virtual address space.

Note: when a method ends with a underscore in PyTorch, such as the method calledshare_memory_(),这意味着该方法具有就地效果,它将改变当前对象,而不是创建新的具有修改。

I’ll now show a Python example of one processing using the data from a Tensor that was allocated on another process by manually exchanging the shared memory handle:

This is executed in the process A:

>>> import torch >>> tensor_a = torch.ones((5, 5)) >>> tensor_a 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [torch.FloatTensor of size 5x5] >>> tensor_a.is_shared() False >>> tensor_a = tensor_a.share_memory_() >>> tensor_a.is_shared() True >>> tensor_a_storage = >>> tensor_a_storage._share_filename_() (b'/var/tmp/tmp.0.yowqlr', b'/torch_31258_1218748506', 25)

在此代码中,在此代码中执行process A, we create a new Tensor of 5×5 filled with ones. After that we make it shared and print the tuple with the Unix Domain Socket address as well as the handle. Now we can access this memory region from anotherprocess Bas shown below:

Code executed in the process B:

>>>进口火炬>>> Tensor_A = TORCH.TENSOR()>>>>>>>>>>>>>>>>YOWQLR',B'/ TORCH_31258_1218748506',25)>>>存储=火炬。storage._new_shared_filename(* tuple_info)>>> tensor_a = torch.tensor(存储).view((5,5))1 1 1 1 1 1 1 1 1 1 11 11 1 1 1 1 1 1 11 1 1 1 1 1 11 1 [TORCH.FLOATTESOR大小5x5]

As you can see, using the tuple information about the Unix Domain Socket address and the handle we were able to access the Tensor storage from another process. If you change the tensor in thisprocess B, you’ll also see that it will reflect in theprocess Abecause these Tensors are sharing the same memory region.

dlpack.: a hope for the Deep Learning frameworks Babel

Now I would like to talk about something recent in the PyTorch code base, that is calleddlpack.。in-memor DLPack是一个开放的标准化y tensor structure that will allow exchange tensor data框架之间, and what is quite interesting is that since this memory representation is standardized and very similar to the memory representation already in use by many frameworks, it will allow azero-copy data sharing between frameworks, which is a quite amazing initiative given the variety of frameworks we have today without inter-communication among them.


The core of DLPack os a very simple structure calledDLTensor, 如下所示:

/ *!* \简短的普通C Tensor对象,不管理内存。* / typedef struct {/ *!* \简要介绍不透明的数据指针指向分配的数据。*这将是OpenCL中的CUDA设备指针或CL_MEM句柄。*此指针始终将256个字节与CUDA保持一致。* / void *数据;/ *!\简要介绍Tensor * / DLContext CTX的设备上下文;/ *! \brief Number of dimensions */ int ndim; /*! \brief The data type of the pointer*/ DLDataType dtype; /*! \brief The shape of the tensor */ int64_t* shape; /*! * \brief strides of the tensor, * can be NULL, indicating tensor is compact. */ int64_t* strides; /*! \brief The offset in bytes to the beginning pointer to data */ uint64_t byte_offset; } DLTensor;

(code fromdlpack.h)

正如你所看到的那样,re is a data pointer for the raw data as well as shape/stride/offset/GPU vs CPU, and other metadata information about the data that theDLTensorpointing to.

There is also a managed version of the tensor that is calledDLManagedTensor,框架可以提供上下文,也可以由借用Tensor借用的框架调用的“DELETET”功能通知其他框架不再需要资源。

In PyTorch, if you want to convert to or from a DLTensor format, you can find both C/C++ methods for doing that or even in Python you can do that as shown below:

从火炬进口火炬。从火炬导入dlpack t =火炬。((5,5))dl = dlpack.to_dlpack(t)


dlmanagedtensor * todlpack(const tensor&src){atendlmtensor * atdlmtensor(new areendlmtensor);atdlmtensor-> handle = src;atdlmtensor-> tensor.manager_ctx = atdlmtensor;atdlmtensor-> tensor.deleter =&deleter;atdlmtensor-> = src.data_ptr();int64_t device_id = 0;if(src.type()。is_cuda()){device_id = src.get_device();} atdlmtensor-> tensor.dl_tensor.ctx = getDlcontext(src.type(),device_id);atdlmtensor-> tensor.dl_tensor.ndim = src.dim();atdlmtensor-> tensor.dl_tensor.dtype = getDLDattype(src.type()); atDLMTensor->tensor.dl_tensor.shape = const_cast(src.sizes().data()); atDLMTensor->tensor.dl_tensor.strides = const_cast(src.strides().data()); atDLMTensor->tensor.dl_tensor.byte_offset = 0; return &(atDLMTensor->tensor); }


我真的希望更多的框架采用这种标准,肯定会给生态系统带来利益。值得注意的是,与之一体化Apache Arrow会很棒。


– Christian S. Perone

将本文引用:Christian S. Perone,“Pytorch - 内部架构之旅”Terra Incognita, 12/03/2018,//

13对“Pytorch - 内部建筑之旅”的思考

  1. Great post! Very interesting to see the details of Pytorch as well as to know that it is well-implemented.

  2. 好贴!但是,我认为您最好添加源代码版本,因为底层后端正在快速更改,并且某些链接已被损坏。

  3. hi Christian, thanks for the insider details on pytorch.

    I have a problem converting from pytorch to numpy and was hoping you could help me understand whats happening and how to fix it.

    In simple terms, I convert an array to pytorch, do a process, then convert back to numpy for subsequent processing using opencv.

    torch_array = torch.from_numpy(numpy_array)# less than 1msec
    do processing on torch_array # less than 1 msec on GPU @ 99%
    numpy_array = np.array(torch_array) # greater than 200 msec

    GPU = Jetson TX1平台上的NVIDIA
    torch = 0.4.0


    1. You should use .numpy().

      torch_array = torch.from_numpy(numpy_array)
      numpy_array = torch_array.numpy()

  4. Well Written! Now I know more about pytorch internals, how it represent/store tensors

  5. Thank you for a great post! It really helped me understand how the Tensor storage works. I could now check if two tensors share the same storage (by ` ==`), but how could I check if a numpy array is a view of a tensor? Is there a way to do a similar check between PyTorch and Numpy? Thank you for your advice in advance!

  6. 很棒的帖子,它确实帮助我了解Pytorch存储。

    我的理解是,我可以从STL矢量创建一个C ++ Pytorch Tensor,并通过没有副本通过Pybind将其暴露于Python。

    I wonder If I could expose a STL vector from C++ into Python and create a tensor from it without making copies, despite the fact that torch.tensor always copies data

答复cocoaaaCancel reply


This site uses Akismet to reduce spam.Learn how your comment data is processed