# PyTorch - 内部建筑之旅

## 介绍

This post is a tour around the PyTorch codebase, it is meant to be a guide for the architectural design of PyTorch and its internals. My main goal is to provide something useful for those who are interested in understanding what happens beyond the user-facing API and show something new beyond what was already covered in other tutorials.

Note:PyTorch build system uses code generation extensively so I won’t repeat here what was already described by others. If you’re interested in understanding how this works, please read the following tutorials:

## Short intro to Python extension objects in C/C++

`// Python object that backs torch.autograd.Variable struct THPVariable { PyObject_HEAD torch::autograd::Variable cdata; PyObject* backward_hooks; };`

Funny fact: it is very common in many applications to use small integer numbers as indexing, counters, etc. For efficiency, the official CPython implementation caches integers from -5 up to 256 for this reason.

## PyTorch张量零拷贝到numpy的，反之亦然

PyTorch has its own Tensor representation, which decouples PyTorch internal representation from external representations. However, as it is very common, especially when data is loaded from a variety of sources, to have Numpy arrays everywhere, therefore we really need to make conversions between Numpy and PyTorch tensors. For that reason, PyTorch provides two methods called`from_numpy（）``numpy的（）`, that converts a Numpy array to a PyTorch array and vice-versa, respectively. If we look the code that is being called to convert a Numpy array into a PyTorch tensor, we can get more insights on the PyTorch’s internal representation:

```
at::Tensor tensor_from_numpy(PyObject* obj) {
  if (!PyArray_Check(obj)) {
    throw TypeError("expected np.ndarray (got %s)", Py_TYPE(obj)->tp_name);
  }

  auto array = (PyArrayObject*)obj;
  int ndim = PyArray_NDIM(array);
  auto sizes = to_aten_shape(ndim, PyArray_DIMS(array));
  auto strides = to_aten_shape(ndim, PyArray_STRIDES(array));

  // NumPy strides use bytes. Torch strides use element counts.
  auto element_size_in_bytes = PyArray_ITEMSIZE(array);
  for (auto& stride : strides) {
    stride /= element_size_in_bytes;
  }

  // (...) - omitted for brevity

  void* data_ptr = PyArray_DATA(array);
  auto& type = CPU(dtype_to_aten(PyArray_TYPE(array)));
  Py_INCREF(obj);
  return type.tensorFromBlob(data_ptr, sizes, strides, [obj](void* data) {
    AutoGIL gil;
    Py_DECREF(obj);
  });
}
```

(code fromtensor_numpy.cpp)

The`tensorFromBlob（）`方法将创建一个新的张量，但只有创建一个新的“存储”这个张量之后。存储是其中实际数据指针将被存储（而不是在张量结构本身）。这需要我们去了解下部分张量存储s

## 张量存储

The CPUFloatStorage is basically a wrapper with utility functions around the actual storage structure called`THFloatStorage`我们下面显示：

`typedef结构THStorage {真实*数据;ptrdiff_t的大小;INT引用计数;焦标志;THAllocator *分配器;无效* allocatorContext;STRUCT THStorage *图;} THStorage;`

(code fromTHStorage。h)

As you can see, the`THStorage`具有指向原始数据，它的尺寸，标志和也是一个有趣的领域被称为`allocator`我们马上要讨论的。同样重要的是要注意，关于如何解释里面的数据没有元数据`THStorage`, this is due to the fact that the storage is “dumb” regarding of its contents and it is the Tensor responsibility to know how to “view” or interpret this data.

`>>> tensor_a = torch.ones((3, 3)) >>> tensor_b = tensor_a.view(9) >>> tensor_a.storage().data_ptr() == tensor_b.storage().data_ptr() True`

Now, as we saw in line 7 of the`THFloatStorage`结构中，存在一个指向`THAllocator`构建那里。而且因为它带来了关于可用亚洲金博宝于分配存储数据分配器的灵活性，这是非常重要的。这种结构通过下面的代码表示：

`typedef struct THAllocator { void* (*malloc)(void*, ptrdiff_t); void* (*realloc)(void*, void*, ptrdiff_t); void (*free)(void*, void*); } THAllocator;`

(code fromTHAllocator.h)

`静态无效* THCudaHostAllocator_malloc（无效* CTX，ptrdiff_t的大小）{void *的PTR;如果（大小<0）THError（ “无效的存储器大小：％LD”，大小）;如果（大小== 0）返回NULL;THCudaCheck（cudaMallocHost（PTR，大小））;返回PTR;}`

(code fromTHCAllocator.c)

• TH=TorcH
• THC=TorcHC乌达
• 乡镇卫生院=TorcHC乌达Sparse
• THCUNN=TorcHCUDANeuralNetwork
• THD=TorcHDistributed
• THNN=TorcHNeuralNetwork
• THS=TorcH 2 Sparse

This convention is also present in the function/class names and other objects, so it is important to always keep these patterns in mind. While you can find CPU allocators in the TH code, you’ll find CUDA allocators in the THC code.

`typedef结构THTensor {*的int64_t大小;*的int64_t步幅;INT n标注;THStorage *存储;ptrdiff_t的storageOffset;INT引用计数;焦标志;} THTensor;`

(Code fromTHTensor.h)

And as you can see, the main`THTensor`结构保持的尺寸/步幅/尺寸/偏移/等以及存储（`THStorage`），用于张量数据。

Now, once we have requirements such as multi-processing where we want to share tensor data among multiple different processes, we need a shared memory approach to solve it, otherwise, every time another process needs a tensor or even when you want to implementHogwildtraining procedure where all different processes will write to the same memory region (where the parameters are), you’ll need to make copies between processes, and this is very inefficient. Therefore we’ll discuss in the next section a special kind of storage for Shared Memory.

## 共享内存

PyTorch周围提供了Python的包装`多`module and can be imported from`torch.multiprocessing`。The changes they implemented in this wrapper around the official Python multiprocessing were done to make sure that everytime a tensor is put on a queue or shared with another process, PyTorch will make sure that only a handle for the shared memory will be shared instead of a new entire copy of the Tensor.

Now, many people aren’t aware of a Tensor method from PyTorch called`share_memory_()`然而，这个功能是什么触发了整个重建该特定张量的存储内存。什么这种方法确实是创建共享存储器的区域，可以不同的过程中被使用。该功能将在年底，拨打以下这个如下功能：

`静态THStorage * THPStorage_（newFilenameStorage）（ptrdiff_t的大小）{INT标志= TH_ALLOCATOR_MAPPED_SHAREDMEM |TH_ALLOCATOR_MAPPED_EXCLUSIVE;的std :: string手柄= THPStorage _（__ newHandle）（）;自动CTX = libshm_context_new（NULL，handle.c_str（），标志）;返回THStorage_（newWithAllocator）（大小，＆THManagedSharedAllocator，（无效*）CTX）;}`

(Code fromStorageSharing.cpp)

Note：当与一个PyTorch下划线的方法结束时，如该方法称为`share_memory_()`, it means that this method has an in-place effect, and it will change the current object instead of creating a new one with the modifications.

I’ll now show a Python example of one processing using the data from a Tensor that was allocated on another process by manually exchanging the shared memory handle:

This is executed in the process A:

`>>>进口炬>>> tensor_a = torch.ones（（5,5））>>> tensor_a 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [尺寸5×5的torch.FloatTensor] >>> tensor_a.is_shared（）假>>> tensor_a = tensor_a.share_memory_（）>>> tensor_a.is_shared（）真>>> tensor_a_storage = tensor_a.storage（）>>> tensor_a_storage。_share_filename_（）（b '/ var / tmp中/ tmp.0.yowqlr'，b '/ torch_31258_1218748506'，25）`

In this code, executed in theprocess A我们创建充满的人的5×5的一个新的张量。之后，我们让共享和打印与Unix域套接字地址元组以及手柄。现在，我们可以从另一个访问该存储区域process Bas shown below:

`>>> import torch >>> tensor_a = torch.Tensor() >>> tuple_info = (b'/var/tmp/tmp.0.yowqlr', b'/torch_31258_1218748506', 25) >>> storage = torch.Storage._new_shared_filename(*tuple_info) >>> tensor_a = torch.Tensor(storage).view((5, 5)) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [torch.FloatTensor of size 5x5]`

## DLPack: a hope for the Deep Learning frameworks Babel

This will certainly help to overcome the “island model” that we have today between tensor representations in MXNet, PyTorch, etc, and will allow developers to mix framework operations between frameworks and all the benefits that a standardization can bring to the frameworks.

DLPack的核心操作系统被称为结构非常简单亚洲金博宝`DLTensor`, as shown below:

`/ *！* \简短的纯C张量的对象，不管理内存。* / typedef结构{/ *！* \介绍不透明数据指针指向分配的数据。*这将在CUDA的OpenCL设备指针或cl_mem手柄。*这个指针始终对齐到256个字节的CUDA。* / void *的数据;/ *！\介绍张量* / DLContext CTX的设备上下文/ *！ \brief Number of dimensions */ int ndim; /*! \brief The data type of the pointer*/ DLDataType dtype; /*! \brief The shape of the tensor */ int64_t* shape; /*! * \brief strides of the tensor, * can be NULL, indicating tensor is compact. */ int64_t* strides; /*! \brief The offset in bytes to the beginning pointer to data */ uint64_t byte_offset; } DLTensor;`

(code fromdlpack.h)

There is also a managed version of the tensor that is called`DLManagedTensor`, where the frameworks can provide a context and also a “deleter” function that can be called by the framework who borrowed the Tensor to inform the other framework that the resources are no longer required.

`import torch from torch.utils import dlpack t = torch.ones((5, 5)) dl = dlpack.to_dlpack(t)`

This Python function will call the`toDLPack`function from ATen, shown below:

`DLManagedTensor* toDLPack(const Tensor& src) { ATenDLMTensor * atDLMTensor(new ATenDLMTensor); atDLMTensor->handle = src; atDLMTensor->tensor.manager_ctx = atDLMTensor; atDLMTensor->tensor.deleter = &deleter; atDLMTensor->tensor.dl_tensor.data = src.data_ptr(); int64_t device_id = 0; if (src.type().is_cuda()) { device_id = src.get_device(); } atDLMTensor->tensor.dl_tensor.ctx = getDLContext(src.type(), device_id); atDLMTensor->tensor.dl_tensor.ndim = src.dim(); atDLMTensor->tensor.dl_tensor.dtype = getDLDataType(src.type()); atDLMTensor->tensor.dl_tensor.shape = const_cast(src.sizes().data()); atDLMTensor->tensor.dl_tensor.strides = const_cast(src.strides().data()); atDLMTensor->tensor.dl_tensor.byte_offset = 0; return &(atDLMTensor->tensor); }`

As you can see, it’s a pretty simple conversion, casting the metadata from the PyTorch format to the DLPack format and assigning a pointer to the internal Tensor data representation.

That’s it, I hope you liked this long post !

- 基督教S. Perone

