# PyTorch – Internal Architecture Tour

Update 28 Feb 2019:我添加了一个new blog post with a slide deckContaining the presentation I did for PyData Montreal.

## 小号Hort intro to Python extension objects in C/C++

`// Python object that backs torch.autograd.Variable struct THPVariable { PyObject_HEAD torch::autograd::Variable cdata; PyObject* backward_hooks; };`

## Zero-copy PyTorch Tensor to Numpy and vice-versa

PyTorch有它自己的张量表示，该解耦对外交涉PyTorch内部表示。然而，因为它是很常见的，尤其是当数据亚洲金博宝从多种来源，装载有numpy的阵列无处不在，所以我们真的需要作出与NumPy和PyTorch张量之间的转换。出于这个原因，PyTorch提供了两个方法叫`from_numpy（）`and`numpy的（）`，that converts a Numpy array to a PyTorch array and vice-versa, respectively. If we look the code that is being called to convert a Numpy array into a PyTorch tensor, we can get more insights on the PyTorch’s internal representation:

`at::Tensor tensor_from_numpy(PyObject* obj) { if (!PyArray_Check(obj)) { throw TypeError("expected np.ndarray (got %s)", Py_TYPE(obj)->tp_name); } auto array = (PyArrayObject*)obj; int ndim = PyArray_NDIM(array); auto sizes = to_aten_shape(ndim, PyArray_DIMS(array)); auto strides = to_aten_shape(ndim, PyArray_STRIDES(array)); // NumPy strides use bytes. Torch strides use element counts. auto element_size_in_bytes = PyArray_ITEMSIZE(array); for (auto& stride : strides) { stride /= element_size_in_bytes; } // (...) - omitted for brevity void* data_ptr = PyArray_DATA(array); auto& type = CPU(dtype_to_aten(PyArray_TYPE(array))); Py_INCREF(obj); return type.tensorFromBlob(data_ptr, sizes, strides, [obj](void* data) { AutoGIL gil; Py_DECREF(obj); }); }`

（代码从tensor_numpy.cpp

After this, PyTorch will create a new Tensor object from this Numpy data blob, and in the creation of this new Tensor it passes the borrowed memory data pointer, together with the memory size and strides as well as a function that will be used later by the Tensor Storage (we’ll discuss this in the next section) to release the data by decrementing the reference counting to the Numpy array object and let Python take care of this object life cycle.

`tensorFromBlob（）`method will create a new Tensor, but only after creating a new “Storage” for this Tensor. The storage is where the actual data pointer will be stored (and not in the Tensor structure itself). This takes us to the next section about张量存储s

## 张量存储

As we saw in the previous code from`tensor_from_numpy()`，有一个呼叫`tensorFromBlob（）`将从原始数据blob创建一个张量。这最后一个功能将调用另外一个函数storageFromBlob（），这将反过来，对于这个数据根据其类型创建存储。在CPU浮点类型的情况下，它会返回一个新的`CPUFloatStorage`实例。

`typedef struct THStorage { real *data; ptrdiff_t size; int refcount; char flag; THAllocator *allocator; void *allocatorContext; struct THStorage *view; } THStorage;`

（代码从THStorage.h

As you can see, the`THStorage`Holds a pointer to the raw data, its size, flags and also an interesting field called`allocator`我们马上要讨论的。同样重要的是要注意，关于如何解释里面的数据没有元数据`THStorage`，这是由于这样的事实，存储是“哑”关于它的内容，它是张量的责任要懂得“视图”或解释这个数据。

`>>> tensor_a = torch.ones（（3,3））>>> tensor_b = tensor_a.view（9）>>> tensor_a.storage（）。DATA_PTR（）== tensor_b.storage（）。DATA_PTR（）真`

As we can see in the example above, the data pointer on the storage of both Tensors are the same, but the Tensors represent a different interpretation of the storage data.

`typedef结构THAllocator {无效*（* malloc的）（无效*，ptrdiff_t的）;无效*（* realloc的）（无效*，无效*，ptrdiff_t的）;空隙（*免费）（无效*，无效*）;} THAllocator;`

（代码从ŤHAllocator.h

`static void *THCudaHostAllocator_malloc(void* ctx, ptrdiff_t size) { void* ptr; if (size < 0) THError("Invalid memory size: %ld", size); if (size == 0) return NULL; THCudaCheck(cudaMallocHost(&ptr, size)); return ptr; }`

（代码从THCAllocator.c

• ŤH=ŤorcH
• THC=ŤorcHC乌达
• THC小号=ŤorcHC乌达小号parse
• THCUNN=ŤorcHCUdañeuralñetwork
• THD=ŤorcHdistributed
• ŤHññ=ŤorcHñeuralñetwork
• THS=ŤorcH小号parse

ŤHis convention is also present in the function/class names and other objects, so it is important to always keep these patterns in mind. While you can find CPU allocators in the TH code, you’ll find CUDA allocators in the THC code.

Finally, we can see the composition of the main Tensor`THTensor`结构体：

`typedef struct THTensor { int64_t *size; int64_t *stride; int nDimension; THStorage *storage; ptrdiff_t storageOffset; int refcount; char flag; } THTensor;`

(Code fromTHTensor。H

And as you can see, the main`THTensor`structure holds the size/strides/dimensions/offsets/etc as well as the storage (`THStorage`）for the Tensor data.

ñow, once we have requirements such as multi-processing where we want to share tensor data among multiple different processes, we need a shared memory approach to solve it, otherwise, every time another process needs a tensor or even when you want to implementHogwildtraining procedure where all different processes will write to the same memory region (where the parameters are), you’ll need to make copies between processes, and this is very inefficient. Therefore we’ll discuss in the next section a special kind of storage for Shared Memory.

## 小号Hared Memory

PyTorch周围提供了Python的包装`multiprocessing`module and can be imported from`torch.multiprocessing`。他们在各地官方Python多这个包装实施的变化做是为了确保每次张量放在一个队列或共享与另一个进程，PyTorch将确保只对共享内存的句柄将被共享，而不是亚洲金博宝张量的新的完整副本。

`static THStorage* THPStorage_(newFilenameStorage)(ptrdiff_t size) { int flags = TH_ALLOCATOR_MAPPED_SHAREDMEM | TH_ALLOCATOR_MAPPED_EXCLUSIVE; std::string handle = THPStorage_(__newHandle)(); auto ctx = libshm_context_new(NULL, handle.c_str(), flags); return THStorage_(newWithAllocator)(size, &THManagedSharedAllocator, (void*)ctx); }`

(Code from小号torageSharing.cpp

And as you can see, this function will create another storage using a special allocator called`ŤHManagedSharedAllocator`。ŤHis function first defines some flags and then it creates a handle which is a string in the format`/torch_[process id]_[random number]`，之后，它就会使用特殊创建一个新的存储`ŤHManagedSharedAllocator`。ŤHis allocator has function pointers to an internal PyTorch library calledlibshm，that will implement aUnix域套接字通信共享所述共享存储器区域的把手。这是分配的实际情况特殊，它是一种“智能分配器”，因为它包含了通信控制逻辑以及它使用另一个名为分配器`THRefcountedMapAllocator`that will be responsible for creating the actual shared memory region and call`mmap()`to map this region to the process virtual address space.

`>>>进口炬>>> tensor_a = torch.ones（（5,5））>>> tensor_a 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [尺寸5×5的torch.FloatTensor] >>> tensor_a.is_shared（）假>>> tensor_a = tensor_a.share_memory_（）>>> tensor_a.is_shared（）真>>> tensor_a_storage = tensor_a.storage（）>>> tensor_a_storage。_share_filename_（）（b '/ var / tmp中/ tmp.0.yowqlr'，b '/ torch_31258_1218748506'，25）`

In this code, executed in the进程A我们创建充满的人的5×5的一个新的张量。之后，我们让共享和打印与Unix域套接字地址元组以及手柄。现在，我们可以从另一个访问该存储区域进程B如下所示：

`>>> import torch >>> tensor_a = torch.Tensor() >>> tuple_info = (b'/var/tmp/tmp.0.yowqlr', b'/torch_31258_1218748506', 25) >>> storage = torch.Storage._new_shared_filename(*tuple_info) >>> tensor_a = torch.Tensor(storage).view((5, 5)) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [torch.FloatTensor of size 5x5]`

## DLPack：对于深度学习一个希望构架巴贝尔

ŤHis will certainly help to overcome the “island model” that we have today between tensor representations in MXNet, PyTorch, etc, and will allow developers to mix framework operations between frameworks and all the benefits that a standardization can bring to the frameworks.

DLPack的核心操作系统被称为结构非常简单亚洲金博宝`DLTensor`，如下所示：

`/*! * \brief Plain C Tensor object, does not manage memory. */ typedef struct { /*! * \brief The opaque data pointer points to the allocated data. * This will be CUDA device pointer or cl_mem handle in OpenCL. * This pointer is always aligns to 256 bytes as in CUDA. */ void* data; /*! \brief The device context of the tensor */ DLContext ctx; /*! \brief Number of dimensions */ int ndim; /*! \brief The data type of the pointer*/ DLDataType dtype; /*! \brief The shape of the tensor */ int64_t* shape; /*! * \brief strides of the tensor, * can be NULL, indicating tensor is compact. */ int64_t* strides; /*! \brief The offset in bytes to the beginning pointer to data */ uint64_t byte_offset; } DLTensor;`

（代码从dlpack.h

As you can see, there is a data pointer for the raw data as well as shape/stride/offset/GPU vs CPU, and other metadata information about the data that the`DLTensor`指向。

In PyTorch, if you want to convert to or from a DLTensor format, you can find both C/C++ methods for doing that or even in Python you can do that as shown below:

`import torch from torch.utils import dlpack t = torch.ones((5, 5)) dl = dlpack.to_dlpack(t)`

ŤHis Python function will call the`toDLPack`function from ATen, shown below:

`DLManagedTensor* toDLPack(const Tensor& src) { ATenDLMTensor * atDLMTensor(new ATenDLMTensor); atDLMTensor->handle = src; atDLMTensor->tensor.manager_ctx = atDLMTensor; atDLMTensor->tensor.deleter = &deleter; atDLMTensor->tensor.dl_tensor.data = src.data_ptr(); int64_t device_id = 0; if (src.type().is_cuda()) { device_id = src.get_device(); } atDLMTensor->tensor.dl_tensor.ctx = getDLContext(src.type(), device_id); atDLMTensor->tensor.dl_tensor.ndim = src.dim(); atDLMTensor->tensor.dl_tensor.dtype = getDLDataType(src.type()); atDLMTensor->tensor.dl_tensor.shape = const_cast(src.sizes().data()); atDLMTensor->tensor.dl_tensor.strides = const_cast(src.strides().data()); atDLMTensor->tensor.dl_tensor.byte_offset = 0; return &(atDLMTensor->tensor); }`

As you can see, it’s a pretty simple conversion, casting the metadata from the PyTorch format to the DLPack format and assigning a pointer to the internal Tensor data representation.

I really hope that more frameworks adopt this standard that will certainly give benefits to the ecosystem. It is also interesting to note that a potential integration withApache的箭would be amazing.

ŤHat’s it, I hope you liked this long post !

– Christian S. Perone

Cite this article as: Christian S. Perone, "PyTorch – Internal Architecture Tour," in亚洲金博宝未知领域，2018年12月3日，//www.cpetem.com/2018/03/pytorch-internal-architecture-tour/

## 13 thoughts to “PyTorch – Internal Architecture Tour”

1. 托马斯 says:

伟大的职位！亚洲金博宝非常有趣的，看看Pytorch的细节，以及知道它是良好的实现。

2. Anonymous says:

这太棒了！

3. 胡志明市 says:

真棒写了！这激励我想在PyThor的代码库看多。

4. Anonymous says:

Great post, thanks!

5. Anonymous says:

ñice post! however, I think you’d better add the source code version since the underlying backend is changing fast and some links are already broken.

6. H says:

喜基督徒，感谢上pytorch的内幕细节。

我有一个问题，从pytorch到numpy的转换，并希望你能帮助我了解发生了什么，以及如何解决它。

简单地说，我转换数组pytorch，执行一个过程，然后转换回numpy的用于使用OpenCV的后续处理。

例：
torch_array = torch.from_numpy（numpy_array）＃小于1毫秒
做torch_array＃处理不到1毫秒施加GPU @ 99％
numpy_array = np.array（torch_array）＃大于200毫秒

GPU = nvidia on jetson TX1 platform
火炬= 0.4.0

regards h

1. 弗拉季斯拉夫·Kurenkov says:

您应该使用.numpy（）。

torch_array = torch.from_numpy(numpy_array)
….
….
numpy_array = torch_array.numpy()

7. Eric says:

Well Written! Now I know more about pytorch internals, how it represent/store tensors

8. dave Kielpinski says:

I definitely appreciate this blog post. Thank you!

9. cocoaaa says:

谢谢你为一个伟大的帖子！这真的帮助我理解张量存储是如何工作的。现在我可以检查两个张量共享相同的存储（由`t0.storage（）。date_ptr（）== t1.storage（）。DATA_PTR（）`），但我怎么能检查是否numpy的阵列的视图张量？有没有办法做到PyTorch和numpy的之间类似的检查？谢谢你的忠告提前！

1. 您可以使用：n_array .__ array_interface __ [“数据”]，然而，这仅仅是用于说明目的，因为比较原始指针不是一个很好的主意。亚洲金博宝

10. ŤHiago says:

Great post, it did help me a lot understanding Pytorch storage.

My understanding is that I can create a C++ pytorch tensor from a STL vector and expose it to Python through pybind without copies.

I wonder If I could expose a STL vector from C++ into Python and create a tensor from it without making copies, despite the fact thathttps://pytorch.org/docs/stable/tensors.htmlsays torch.tensor always copies data