Gandiva, using LLVM and Arrow to JIT and evaluate Pandas expressions

Introduction

This is the post of 2020, so新年快乐给你所有!

I’m a huge fan of LLVM since 11 years ago when I started playing with it toJIT data structures如AVLS,然后稍后JIT限制AST树JIT native code from TensorFlow graphs。此后,LLVM演变成最重要的编译器框架的生态系统之一,是由很多重要的开源项目采用了时下。

一个很酷的项目,我最近才知道的就是Gandiva。Gandivawas developed byDremio再后来捐赠给Apache的箭(kudos to Dremio team for that)。Gandiva的主要思想是,它提供了一个编译器生成LLVM IR可以在分批操作Apache Arrow。Gandiva被用C ++编写,并配有很多实现构建表达式树,可以是使用JIT'ed LLVM不同的功能。这种设计的一个很好的特性是,它可以使用LLVM来自动优化复杂的表达式,增加了原生的目标平台矢量如AVX同时箭批量操作和执行本机代码,以计算表达式。

The image below gives an overview of Gandiva:

的Gandiva如何工作的概述。图片来自:https://www.dremio.com/announcing-gandiva-initiative-for-apache-arrow

在这篇文章中我将建立一个非常简单的表达式解析器支持一亚洲金博宝组有限的,我会用它来筛选数据框大熊猫操作。

建设有Gandiva简单的表达

在这一部分,我将展示如何创建使用树构建手动Gandiva一个简单的表达。

Using Gandiva Python bindings to JIT and expression

Before building our parser and expression builder for expressions, let’s manually build a simple expression with Gandiva. First, we will create a simple Pandas DataFrame with numbers from 0.0 to 9.0:

import pandas as pd import pyarrow as pa import pyarrow.gandiva as gandiva # Create a simple Pandas DataFrame df = pd.DataFrame({"x": [1.0 * i for i in range(10)]}) table = pa.Table.from_pandas(df) schema = pa.Schema.from_pandas(df)

我们转换的数据帧到箭头表, it is important to note that in this case it was a zero-copy operation, Arrow isn’t copying data from Pandas and duplicating the DataFrame. Later we get theschema从表中,包含列类型和其他元数据。

在那之后,我们要使用Gandiva建立下面的表达式来过滤数据:

(X> 2.0)和(x <6.0)

这个表达式将使用节点从Gandiva可以了:

助洗剂= gandiva.TreeExprBuilder()#参考列的 “x” node_x = builder.make_field(table.schema.field( “×”))#提出两个文字:2.0和6.0 2 = builder.make_literal(2.0,pa.float64())6 = builder.make_literal(6.0,pa.float64())#为 “X> 2.0” gt_five_node = builder.make_function一个函数( “GREATER_THAN”,[node_x,两],pa.bool_())#创建 “×<6.0” 的函数lt_ten_node = builder.make_function( “LESS_THAN”,[node_x,六],pa.bool_())#创建一个 “和” 节点,为“(X> 2.0)和(x <6.0)” and_node = builder.make_and([gt_five_node,lt_ten_node])#使表达的条件,并创建一个过滤条件= builder.make_condition(and_node)filter_ = gandiva.make_filter(table.schema,条件)

This code now looks a little more complex but it is easy to understand. We are basically creating the nodes of a tree that will represent the expression we showed earlier. Here is a graphical representation of what it looks like:

Inspecting the generated LLVM IR

不幸的是,还没有找到一种方法来转储用箭头的Python绑定生成LLVM IR,但是,我们可以只使用C ++ API构建相同的树,然后查看生成的LLVM IR:

auto field_x = field("x", float32()); auto schema = arrow::schema({field_x}); auto node_x = TreeExprBuilder::MakeField(field_x); auto two = TreeExprBuilder::MakeLiteral((float_t)2.0); auto six = TreeExprBuilder::MakeLiteral((float_t)6.0); auto gt_five_node = TreeExprBuilder::MakeFunction("greater_than", {node_x, two}, arrow::boolean()); auto lt_ten_node = TreeExprBuilder::MakeFunction("less_than", {node_x, six}, arrow::boolean()); auto and_node = TreeExprBuilder::MakeAnd({gt_five_node, lt_ten_node}); auto condition = TreeExprBuilder::MakeCondition(and_node); std::shared_ptr filter; auto status = Filter::Make(schema, condition, TestConfiguration(), &filter);

The code above is the same as the Python code, but using the C++ Gandiva API. Now that we built the tree in C++, we can get the LLVM Module and dump the IR code for it. The generated IR is full of boilerplate code and the JIT’ed functions from the Gandiva registry, however the important parts are show below:

; Function Attrs: alwaysinline norecurse nounwind readnone ssp uwtable define internal zeroext i1 @less_than_float32_float32(float, float) local_unnamed_addr #0 { %3 = fcmp olt float %0, %1 ret i1 %3 } ; Function Attrs: alwaysinline norecurse nounwind readnone ssp uwtable define internal zeroext i1 @greater_than_float32_float32(float, float) local_unnamed_addr #0 { %3 = fcmp ogt float %0, %1 ret i1 %3 } (...) %x = load float, float* %11 %greater_than_float32_float32 = call i1 @greater_than_float32_float32(float %x, float 2.000000e+00) (...) %x11 = load float, float* %15 %less_than_float32_float32 = call i1 @less_than_float32_float32(float %x11, float 6.000000e+00)

As you can see, on the IR we can see the call to the functionsless_than_float32_float_32andgreater_than_float32_float32这是(在这种情况下很简单的)Gandiva功能做浮动比亚洲金博宝较。通过查看函数名前缀注意函数的专业化。

什么是颇为有趣的是,LLVM将适用于所有的优化在这个代码,它会为目标平台的高效的本地代码同时戈黛娃和LLVM将采取确保内存对齐将成为扩展,如AVX用于正确的护理矢量。

这IR代码我发现是不是真正执行了一个,但优化的一个。和在优化的一个我们可以看到,内联LLVM的功能,如显示在下面的优化代码的一部分:

%x.us = load float, float* %10, align 4 %11 = fcmp ogt float %x.us, 2.000000e+00 %12 = fcmp olt float %x.us, 6.000000e+00 %not.or.cond = and i1 %12, %11

你可以看到,表达的是现在简单多了优化后的LLVM应用其强大的优化和内联很多Gandiva funcions的。

建设有Gandiva一个熊猫过滤器表达式JIT

Now we want to be able to implement something similar as the Pandas’DataFrame.query()function using Gandiva. The first problem we will face is that we need to parse a string such as(X> 2.0)和(x <6.0),以后我们将不得不建立使用从Gandiva树构建的Gandiva表达式树,然后评估上箭头的数据表达。

Now, instead of implementing a full parsing of the expression string, I’ll use the Python AST module to parse valid Python code and build an Abstract Syntax Tree (AST) of that expression, that I’ll be later using to emit the Gandiva/LLVM nodes.

解析字符串的繁重的工作将被委托给Python的AST模块和我们的工作将主要走在这棵树,并基于该语法树发出Gandiva节点。访问此Python的AST树的节点和发射节点Gandiva的代码如下所示:

class LLVMGandivaVisitor(ast.NodeVisitor): def __init__(self, df_table): self.table = df_table self.builder = gandiva.TreeExprBuilder() self.columns = {f.name: self.builder.make_field(f) for f in self.table.schema} self.compare_ops = { "Gt": "greater_than", "Lt": "less_than", } self.bin_ops = { "BitAnd": self.builder.make_and, "BitOr": self.builder.make_or, } def visit_Module(self, node): return self.visit(node.body[0]) def visit_BinOp(self, node): left = self.visit(node.left) right = self.visit(node.right) op_name = node.op.__class__.__name__ gandiva_bin_op = self.bin_ops[op_name] return gandiva_bin_op([left, right]) def visit_Compare(self, node): op = node.ops[0] op_name = op.__class__.__name__ gandiva_comp_op = self.compare_ops[op_name] comparators = self.visit(node.comparators[0]) left = self.visit(node.left) return self.builder.make_function(gandiva_comp_op, [left, comparators], pa.bool_()) def visit_Num(self, node): return self.builder.make_literal(node.n, pa.float64()) def visit_Expr(self, node): return self.visit(node.value) def visit_Name(self, node): return self.columns[node.id] def generic_visit(self, node): return node def evaluate_filter(self, llvm_mod): condition = self.builder.make_condition(llvm_mod) filter_ = gandiva.make_filter(self.table.schema, condition) result = filter_.evaluate(self.table.to_batches()[0], pa.default_memory_pool()) arr = result.to_array() pd_result = arr.to_numpy() return pd_result @staticmethod def gandiva_query(df, query): df_table = pa.Table.from_pandas(df) llvm_gandiva_visitor = LLVMGandivaVisitor(df_table) mod_f = ast.parse(query) llvm_mod = llvm_gandiva_visitor.visit(mod_f) results = llvm_gandiva_visitor.evaluate_filter(llvm_mod) return results

正如你所看到的,它的代码,我不支持每一个可能的Python表达式,但它的一个子集轻微非常简单。亚洲金博宝我们做这个班什么是基本的比较和BinOps(二元运算)的Gandiva节点的Python AST的转换节点,。我也正在改变的语义&|operators to represent AND and OR respectively, such as in Pandas查询()功能。

注册为熊猫扩展

下一步是使用创建一个简单的熊猫扩展gandiva_query()method that we created:

@ pd.api.extensions.register_dataframe_accessor( “gandiva”)类GandivaAcessor:高清__init __(自我,pandas_obj):self.pandas_obj = pandas_obj高清查询(个体经营,查询):返回LLVMGandivaVisitor.gandiva_query(self.pandas_obj,查询)

这就是它,现在我们可以使用这个扩展做的事情,例如:

df = pd.DataFrame({"a": [1.0 * i for i in range(nsize)]}) results = df.gandiva.query("a > 10.0")

As we have registered a Pandas extension calledgandiva这是现在的大熊猫DataFrames的一等公民。

现在,让我们创建一个500万辆花车数据框,并使用新查询()method to filter it:

df = pd.DataFrame({"a": [1.0 * i for i in range(50000000)]}) df.gandiva.query("a < 4.0") # This will output: # array([0, 1, 2, 3], dtype=uint32)

请注意,返回的值是满足我们实施条件的指标,因此它比大熊猫不同查询()返回该数据已过滤。

我做了一些基准测试,我们发现甘戴瓦ually always faster than Pandas, however I’ll leave proper benchmarks for a next post on Gandiva as this post was to show how you can use it to JIT expressions.

That’s it ! I hope you liked the post as I enjoyed exploring Gandiva. It seems that we will probably have more and more tools coming up with Gandiva acceleration, specially for SQL parsing/projection/JITing. Gandiva is much more than what I just showed, but you can get started now to understand more of its architecture and how to build the expression trees.

– Christian S. Perone

引用本文为:基督教S. Perone,“Gandiva,使用LLVM和箭JIT和评估熊猫表情,”在Terra Incognita, 19/01/2020,//www.cpetem.com/2020/01/gandiva-using-llvm-and-arrow-to-jit-and-evaluate-pandas-expressions/

Listening to the neural network gradient norms during training

Training neural networks is often done by measuring many different metrics such as accuracy, loss, gradients, etc. This is most of the time done aggregating these metrics and plotting visualizations on TensorBoard.

There are, however, other senses that we can use to monitor the training of neural networks, such as声音。Sound is one of the perspectives that is currently very poorly explored in the training of neural networks. Human hearing can be very good a distinguishing very small perturbations in characteristics such as rhythm and pitch, even when these perturbations are very short in time or subtle.

对于这个实验,我做出示出利用各层的和用于使用不同的设置,如不亚洲金博宝同的学习率,优化器,动量上MNIST卷积神经网络训练的训练步骤中的梯度范数由合成声音一个非常简单的例子等等。

你需要安装PyAudioandPyTorch以运行该代码(在the end of this post)。

使用LR 0.01 SGD培训声

这个段表示在第一个200步所述第一时期与来自4层梯度的训练会话,并使用10的更高的间距的批量大小,较高的一个层的规范,有一个短暂的沉默,以指示不同批次。注意梯度时间内增加。

使用LR 0.1 SGD培训声

同上,但具有较高的学习率。

使用LR 1.0 SGD培训声

与上述相同,但较高的学习速度,使网络发散,讲究高音时的规范爆炸,然后发散。

使用LR 1.0 SGD培训声and BS 256

相同的设置,但与1.0的较高的学习率和256注梯度如何爆炸和批量大小,然后有导致最终的声音提示NaN。

Training sound with Adam using LR 0.01

This is using Adam in the same setting as the SGD.

Source code

For those who are interested, here is the entire source code I used to make the sound clips:

进口pyaudio进口numpy np波不进口rt torch import torch.nn as nn import torch.nn.functional as F import torch.optim as optim from torchvision import datasets, transforms class Net(nn.Module): def __init__(self): super(Net, self).__init__() self.conv1 = nn.Conv2d(1, 20, 5, 1) self.conv2 = nn.Conv2d(20, 50, 5, 1) self.fc1 = nn.Linear(4*4*50, 500) self.fc2 = nn.Linear(500, 10) self.ordered_layers = [self.conv1, self.conv2, self.fc1, self.fc2] def forward(self, x): x = F.relu(self.conv1(x)) x = F.max_pool2d(x, 2, 2) x = F.relu(self.conv2(x)) x = F.max_pool2d(x, 2, 2) x = x.view(-1, 4*4*50) x = F.relu(self.fc1(x)) x = self.fc2(x) return F.log_softmax(x, dim=1) def open_stream(fs): p = pyaudio.PyAudio() stream = p.open(format=pyaudio.paFloat32, channels=1, rate=fs, output=True) return p, stream def generate_tone(fs, freq, duration): npsin = np.sin(2 * np.pi * np.arange(fs*duration) * freq / fs) samples = npsin.astype(np.float32) return 0.1 * samples def train(model, device, train_loader, optimizer, epoch): model.train() fs = 44100 duration = 0.01 f = 200.0 p, stream = open_stream(fs) frames = [] for batch_idx, (data, target) in enumerate(train_loader): data, target = data.to(device), target.to(device) optimizer.zero_grad() output = model(data) loss = F.nll_loss(output, target) loss.backward() norms = [] for layer in model.ordered_layers: norm_grad = layer.weight.grad.norm() norms.append(norm_grad) tone = f + ((norm_grad.numpy()) * 100.0) tone = tone.astype(np.float32) samples = generate_tone(fs, tone, duration) frames.append(samples) silence = np.zeros(samples.shape[0] * 2, dtype=np.float32) frames.append(silence) optimizer.step() # Just 200 steps per epoach if batch_idx == 200: break wf = wave.open("sgd_lr_1_0_bs256.wav", 'wb') wf.setnchannels(1) wf.setsampwidth(p.get_sample_size(pyaudio.paFloat32)) wf.setframerate(fs) wf.writeframes(b''.join(frames)) wf.close() stream.stop_stream() stream.close() p.terminate() def run_main(): device = torch.device("cpu") train_loader = torch.utils.data.DataLoader( datasets.MNIST('../data', train=True, download=True, transform=transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,)) ])), batch_size=256, shuffle=True) model = Net().to(device) optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.5) for epoch in range(1, 2): train(model, device, train_loader, optimizer, epoch) if __name__ == "__main__": run_main()
Cite this article as: Christian S. Perone, "Listening to the neural network gradient norms during training," inTerra Incognita, 04/08/2019,//www.cpetem.com/2019/08/listening-to-the-neural-network-gradient-norms-during-training/

numpy的调度:当numpy的成为协议的生态系统

Introduction

不是有很多人使用Python科学的生态系统合作都知道的NEP 18(派遣机构与NumPy的高层阵列功能)。鉴于该协议的重要性,我决定写这篇简短的开场白,新的调度肯定会带来很多的好处对于Python的科学生态系统。

如果您使用PyTorch,TensorFlow,DASK,等等,你肯定注意到了与numpy的他们的API合同的相似性。它由事故不是,numpy的的API是科学计算中最基本也是最广泛使用的API之一。numpy的是如此普遍,它不再是唯一的API,它变得越来越的协议或API规范。

阅读更多

Randomized prior functions in PyTorch

受过训练MLP 2隐藏层和正弦之前。

我尝试用这种方法在描述“深强化学习随机函数之前”由Ian Osband等。在NPS 2018,在那里它们被设计用于使用自举并随机先验的不确定性亚洲金博宝,并决定共享PyTorch代码一种非常简单和实用的方法。

我真的很喜欢引导方法,在我看来,他们通常是执行,并提供非常好的后逼近深连接,贝叶斯方法,而不必应付变推理最简单的方法。亚洲金博宝它们实际上显示,在所述线性情况下,该方法提供了一种贝叶斯后的纸张。

The main idea of the method is to have bootstrap to provide a non-parametric data perturbation together with randomized priors, which are nothing more than just random initialized networks.

$$ Q _ {\ theta_k}(X)= F _ {\ theta_k}(X)+ P_K(x)的$$

The final model \(Q_{\theta_k}(x)\) will be the k model of the ensemble that will fit the function \(f_{\theta_k}(x)\) with an untrained prior \(p_k(x)\).

Let’s go to the code. The first class is a simple MLP with 2 hidden layers and Glorot initialization :

类MLP(nn.Module):DEF __init __(个体):超级().__ INIT __()self.l1 = nn.Linear(1,20)self.l2 = nn.Linear(20,20)self.l3 = NN。Linear(20, 1) nn.init.xavier_uniform_(self.l1.weight) nn.init.xavier_uniform_(self.l2.weight) nn.init.xavier_uniform_(self.l3.weight) def forward(self, inputs): x = self.l1(inputs) x = nn.functional.selu(x) x = self.l2(x) x = nn.functional.selu(x) x = self.l3(x) return x

Then later we define a class that will take the model and the prior to produce the final model result:

class ModelWithPrior(nn.Module): def __init__(self, base_model : nn.Module, prior_model : nn.Module, prior_scale : float = 1.0): super().__init__() self.base_model = base_model self.prior_model = prior_model self.prior_scale = prior_scale def forward(self, inputs): with torch.no_grad(): prior_out = self.prior_model(inputs) prior_out = prior_out.detach() model_out = self.base_model(inputs) return model_out + (self.prior_scale * prior_out)

And it’s basically that ! As you can see, it’s a very simple method, in the second part we just created a custom forward() to avoid computing/accumulating gradients for the prior network and them summing (after scaling) it with the model prediction.

要训​​练它,你只需要使用不同的白手起家每个集成模型,就像下面的代码:

DEF train_model(x_train,y_train,base_model,prior_model):模型= ModelWithPrior(base_model,prior_model,1.0)loss_fn = nn.MSELoss()优化= torch.optim.Adam(model.parameters(),LR = 0.05)为在历元范围(100):model.train()preds =模型(x_train)损耗= loss_fn(preds,y_train)optimizer.zero_grad()loss.backward()optimizer.step()的返回模型

and using a sampler with replacement (bootstrap) as in:

数据集= TensorDataset(...)bootstrap_sampler = RandomSampler(数据集,TRUE,LEN(数据集))train_dataloader =的DataLoader(数据集,=的batch_size LEN(数据集),采样= bootstrap_sampler)

In this case, I used the same small dataset used in the original paper:

用一个简单的MLP训练它之后现有以及,对于不确定性的结果示于下面:

现有训练的模型与MLP,使用50种型号的合奏。

If we look at just the priors, we will see the variation of the untrained networks:

We can also visualize the individual model predictions showing their variability due to different initializations as well as the bootstrap noise:

绘制显示在红色每个单独的模型预测和真实数据。

Now, what is also quite interesting, is that we can change the prior to let’s say a fixed sine:

类SinPrior(nn.Module):高清向前(个体经营,输入):返回torch.sin(3 *输入)

Then, when we train the same MLP model but this time using the sine prior, we can see how it affects the final prediction and uncertainty bounds:

If we show each individual model, we can see the effect of the prior contribution to each individual model:

Plot showing each individual model of the ensemble trained with a sine prior.

我希望你喜欢,这是一个简单的方法,至少通过线性“健全性检查”相当惊人的结果。我会代替之前的探讨一些预先训练网络上看到预测的不同影响,这是一个添加一些简单的先验概率非常有趣的方式。亚洲金博宝

Cite this article as: Christian S. Perone, "Randomized prior functions in PyTorch," inTerra Incognita, 24/03/2019,//www.cpetem.com/2019/03/randomized-prior-functions-in-pytorch/

188bet开户平台

*这篇文章是在葡萄牙语。It’s a bayesian analysis of a Brazilian national exam. The main focus of the analysis is to understand the underlying factors impacting the participants performance on ENEM.

Este tutorial apresenta uma análise breve dos microdados do ENEM do Rio Grande do Sul do ano de 2017. O principal objetivo é entender os fatores que impactam na performance dos participantes do ENEM dado fatores como renda familiar e tipo de escola. Neste tutorial são apresentados dois modelos: regressão linear e regressão linear hierárquica.

188bet开户平台

188betiosapp

Update 28 Feb 2019:我添加了一个new blog post with a slide deck包括我做了PyData蒙特利尔的表现。

Today, at the PyTorch Developer Conference, the PyTorch team announced the plans and the release of the PyTorch 1.0 preview with many nice features such as a JIT for model graphs (with and without tracing) as well as theLibTorch中,PyTorch C ++ API中的所述一个most important release announcementmade today in my opinion.

鉴于理解这个新的API是如何工作的,我决定写这篇文章显示的是现在PyTorch C ++ API发布后打开许多机会的一个例子的巨大兴趣。在这篇文章中,我将PyTorch推理整合到一个使用的NodeJS c + +插件,就像不同的框架/语言使用C ++ API,它现在可能的整合的一个例子本土的NodeJS。

下面你可以看到最后的结果:

正如你所看到的,整合是无缝的,我可以用一个跟踪RESNET作为计算图模型和饲料任何张量它得到的输出预测。

Introduction

Simply put, the libtorch is a library version of the PyTorch. It contains the underlying foundation that is used by PyTorch, such as theATen(张量库),它包含了所有的张量操作和方法。还Libtorch包含autograd, which is the component that adds the automatic differentiation to the ATen tensors.

A word of caution for those who are starting now is to be careful with the use of the tensors that can be created both from ATen and autograd,不要混合使用它们, the ATen will return the plain tensors (when you create them using theatnamespace) while the autograd functions (from the火炬命名空间)将返回Variable中,加入其自动分化机制。

有关PyTorch内部工作方式的一个更广泛的教程,请把我以前的上教程一看PyTorch内部架构

Libtorch可以从下载Pytorch网站它仅作为一会儿预览。您还可以找到文档this site,这主要是一个Doxygen的渲染文档。我发现图书馆相当稳定,这是有道理的,因为它实际上是暴露PyTorch的稳定基础,但是,也有一些问题,标题和一些小问题,关于图书馆的组织,而开始使用它,你可能会发现(将希望尽快修复)。

For NodeJS, I’ll use the本土抽象库(楠),这是最值得推荐的库(实际上基本是仅标头库)来创建的NodeJS C ++的附加组件和cmake的-JS, because libtorch already provide the cmake files that make our building process much easier. However, the focus here will be on the C++ code and not on the building process.

The flow for the development, tracing, serializing and loading the model can be seen in the figure on the left side.

It starts with the development process and tracing being done in PyTorch (Python domain) and then the loading and inference on the C++ domain (in our case in NodeJS add-on).

结束语张量

在的NodeJS,创建一个对象的JavaScript的世界一流的公民,你需要继承自ObjectWrap类,它将负责包装一个C ++ component.

的#ifndef TENSOR_H的#define TENSOR_H的#include 的#include <炬/ torch.h>命名空间torchjs {类张量:公共楠:: ObjectWrap {公共:静态NAN_MODULE_INIT(初始化);空隙setTensor(在::张量张量){这个 - > mTensor =张量;}炬::张量getTensor(){返回这个 - > mTensor;}静态V8 ::本地 NewInstance方法();私人:明确的张量();张量〜();静态NAN_METHOD(新);静态NAN_METHOD(的toString);静态楠::持久构造;私人:火炬::张量mTensor; }; } // namespace torchjs #endif

正如你所看到的,最让我们张量类定义的代码仅仅是样板。这里的关键点是,火炬js::Tensorwill wrap a火炬::Tensor我们增加了两个特殊的公共方法(setTensorandgetTensor)来设置和得到这个内部火炬张量。

I won’t show all the implementation details because most parts of it are NodeJS boilerplate code to construct the object, etc. I’ll focus on the parts that touch the libtorch API, like in the code below where we are creating a small textual representation of the tensor to show on JavaScript (toStringmethod):

NAN_METHOD(Tensor::toString) { Tensor* obj = ObjectWrap::Unwrap(info.Holder()); std::stringstream ss; at::IntList sizes = obj->mTensor.sizes(); ss << "Tensor[Type=" << obj->mTensor.type() << ", "; ss << "Size=" << sizes << std::endl; info.GetReturnValue().Set(Nan::New(ss.str()).ToLocalChecked()); }

我们在上面做代码,通过在刚刚起步的内部对象张从被包装的对象解缠它。之后,我们建立与张力的大小(各维的尺寸)和它的类型(浮法等)的字符串表示。

包装张量创建操作

让我们创建一个现在包装代码为火炬::者函数负责创建填充有恒定1项的任何定义的形状的张量。

NAN_METHOD(个){//理智的参数检查,如果(info.Length()<2)返回楠:: ThrowError(楠::新( “错误的数目的参数”)ToLocalChecked());如果(!信息[0]  - > IsArray的()||信息[1]  - > IsBoolean()!)返回楠:: ThrowError(楠::新( “错误的参数类型”)ToLocalChecked());//检索参数(require_grad和张量形状)const的布尔require_grad =信息[1]  - > BooleanValue中();常量V8 ::本地阵列=信息[0]。如();常量uint32_t的长度=阵列 - >长度();//从V8 ::数组转换为标准::矢量的std ::矢量<长长>变暗;对于(中间体I = 0;我<长度;我++){V8 ::本地 V;INT d =阵列 - >获取(ⅰ) - > NumberValue();dims.push_back(d);} //调用libtorch并创建一个新torchjs ::张量对象//包装新火炬::张量,是由火炬创建::在::张量V =火炬::一(变暗,火炬:: requires_grad的人(require_grad)); auto newinst = Tensor::NewInstance(); Tensor* obj = Nan::ObjectWrap::Unwrap(newinst); obj->setTensor(v); info.GetReturnValue().Set(newinst); }

所以,让我们通过这个代码。我们首先检查函数的参数。对于这个功能,我们期待为张量形状和指示,如果我们要计算梯度或不是此张量节点的布尔元组(JavaScript数组)。在那之后,我们把从V8的JavaScript类型分为本地C ++类型的参数。一旦我们所需要的参数,我们再调用火炬::者function from the libtorch, this function will create a new tensor where we use a火炬js::Tensor类,我们创建较早把它包起来。

And that’s it, we just exposed one torch operation that can be used as native JavaScript operation.

Intermezzo for the PyTorch JIT

The introduced PyTorch JIT revolves around the concept of the Torch Script. A Torch Script is a restricted subset of the Python language and comes with its own compiler and transform passes (optimizations, etc).

该脚本可以用两种不同方式创建:通过使用跟踪JIT或通过提供脚本本身。在跟踪模式下,计算图节点将被访问并记录操作以生产最终的脚本,而脚本是你提供你的模型拍摄的描述考虑到火炬脚本的限制模式。

Note that if you have branching decisions on your code that depends on external factors or data, tracing won’t work as you expect because it will record that particular execution of the graph, hence the alternative option to provide the script. However, in most of the cases, the tracing is what we need.

To understand the differences, let’s take a look at the Intermediate Representation (IR) from the script module generated both by tracing and by scripting.

@ torch.jit.script DEF happy_function_script(X):RET = torch.rand(0)如果真== TRUE:RET = torch.rand(1)否则:RET = torch.rand(2)返回RET DEF happy_function_trace(X):t RET = torch.rand(0)如果真== TRUE:RET = torch.rand(1)否则:RET = torch.rand(2)返回RET traced_fn = torch.jit.trace(happy_function_trace,(torch.tensor(0),),check_trace = FALSE)

在上面的代码中,我们提供两个功能,一个是使用@ torch.jit.scriptdecorator, and it is the scripting way to create a Torch Script, while the second function is being used by the tracing function火炬。JIT。trace。Not that I intentionally added a “True == True” decision on the functions (which will always be true).

现在,如果我们考察这两个不同的方法所产生的IR,我们会清楚地看到跟踪和脚本方法之间的区别:

# 1) Graph from the scripting approach graph(%x : Dynamic) { %16 : int = prim::Constant[value=2]() %10 : int = prim::Constant[value=1]() %7 : int = prim::Constant[value=1]() %8 : int = prim::Constant[value=1]() %9 : int = aten::eq(%7, %8) %ret : Dynamic = prim::If(%9) block0() { %11 : int[] = prim::ListConstruct(%10) %12 : int = prim::Constant[value=6]() %13 : int = prim::Constant[value=0]() %14 : int[] = prim::Constant[value=[0, -1]]() %ret.2 : Dynamic = aten::rand(%11, %12, %13, %14) -> (%ret.2) } block1() { %17 : int[] = prim::ListConstruct(%16) %18 : int = prim::Constant[value=6]() %19 : int = prim::Constant[value=0]() %20 : int[] = prim::Constant[value=[0, -1]]() %ret.3 : Dynamic = aten::rand(%17, %18, %19, %20) -> (%ret.3) } return (%ret); } # 2) Graph from the tracing approach graph(%0 : Long()) { %7 : int = prim::Constant[value=1]() %8 : int[] = prim::ListConstruct(%7) %9 : int = prim::Constant[value=6]() %10 : int = prim::Constant[value=0]() %11 : int[] = prim::Constant[value=[0, -1]]() %12 : Float(1) = aten::rand(%8, %9, %10, %11) return (%12); }

正如我们所看到的,IR是非常相似的亚洲金博宝LLVM IR应注意,在跟踪方法,跟踪记录包含的代码,真理的路径只有一条路径,而在脚本我们既有分支的替代品。然而,即使在脚本中,总是假的分支可以进行优化,并与死代码消除变换通过去除。

PyTorch JIT有很多被用来做循环展开改造通行证,死代码消除等。您可以找到这些的这里传递。不是转换成其它格式,例如ONNX可以被实施为在该中间表示(IR),这是相当方便的顶部一通。

Tracing the ResNet

Now, before implementing the Script Module in NodeJS, let’s first trace a ResNet network using PyTorch (using just Python):

traced_net = torch.jit.trace(torchvision.models.resnet18(),torch.rand(1,3,224,224))traced_net.save( “resnet18_trace.pt”)

As you can see from the code above, we just have to provide a tensor example (in this case a batch of a single image with 3 channels and size 224×224. After that we just save the traced network into a file calledresnet18_trace.pt

现在,我们已经准备好执行脚本模块中的NodeJS以加载该文件被追踪。

包裹脚本模块

这是现在的脚本模块中的的NodeJS执行:

// Class构造ScriptModule :: ScriptModule(常量的std :: string文件名){//加载从文件这 - > mModule =炬:: JIT ::负载(文件名)所追踪的网络;} // JavaScript对象创建NAN_METHOD(ScriptModule ::新){如果(info.IsConstructCall()){//获取文件名参数V8 ::字符串:: Utf8Value param_filename(信息[0]  - >的ToString());常量的std :: string文件名=的std :: string(* param_filename);//使用该文件名ScriptModule * OBJ =新ScriptModule(文件名)创建一个新的脚本模块;obj->裹(info.This());。info.GetReturnValue()设置(info.This());}否则{V8 ::本地缺点=楠::新(构造);。info.GetReturnValue()设置(楠:: NewInstance方法(缺点).ToLocalChecked());}}

正如你可以从上面的代码中看到,我们只是创建一个类,将调用火炬:: JIT ::负荷function passing a file name of the traced network. We also have the implementation of the JavaScript object, where we convert parameters to C++ types and then create a new instance of the火炬js::ScriptModule

The wrapping of the forward pass is also quite straightforward:

NAN_METHOD(ScriptModule ::向前){ScriptModule * script_module = ObjectWrap ::展开(info.Holder());楠:: MaybeLocal 说不定=楠::要(信息[0]);张量*张量=楠:: ObjectWrap ::展开<张量>(maybe.ToLocalChecked());火炬::张量torch_tensor = tensor-> getTensor();火炬::张量输出= script_module-> mModule->向前({torch_tensor})toTensor();自动newinst中=张量:: NewInstance方法();张量* OBJ =楠:: ObjectWrap ::展开<张量>(newinst中);obj-> setTensor(输出);。info.GetReturnValue()设置(newinst中);}

As you can see, in this code, we just receive a tensor as an argument, we get the internal火炬::Tensorfrom it and then call the forward method from the script module, we wrap the output on a new火炬js::Tensor和n return it.

就是这样,我们就可以使用原生我们的NodeJS内置模块,如下面的例子:

VAR torchjs =需要( “./构建/释放/ torchjs”);VAR = script_module新torchjs.ScriptModule( “resnet18_trace.pt”);VAR数据= torchjs.ones([1,3,224,224],假);VAR输出= script_module.forward(数据);

我希望你喜欢!Libtorch打开了在许多不同的语言和框架的紧密集成PyTorch,这是非常令人兴奋,对生产部署代码的方向迈出了一大步门。

– Christian S. Perone

引用本文为:基督教S. Perone, “PyTorch 1.0追踪JIT和LibTorch C ++ API来PyTorch融入的NodeJS,” 在Terra Incognita, 02/10/2018,188betiosapp

PyTorch – Internal Architecture Tour

Update 28 Feb 2019:我添加了一个new blog post with a slide deck包括我做了PyData蒙特利尔的表现。

Introduction

这篇文章是围绕PyTorch代码库参观,它的目的是为PyTorch和其内部的建筑设计指导。我的主要目标是提供一些对于那些谁有兴趣了解发生什么事超出了面向用户的API,并展示一些新的东西超出了已经包含在其他的教程非常有用。

注意:PyTorch构建系统使用的代码生成广泛,所以我就不在这里重复了已经被别人描述。如果您有兴趣了解如何工作的,请阅读以下教程:

短介绍到Python扩展在C / C ++对象

As you probably know, you can extend Python using C and C++ and develop what is called as “extension”. All the PyTorch heavy work is implemented in C/C++ instead of pure-Python. To define a new Python object type in C/C++, you define a structure like this one example below (which is the base for the autogradVariable类):

// Python对象即背torch.autograd.Variable结构THPVariable {PyObject_HEAD炬:: autograd ::可变CDATA;*的PyObject backward_hooks;};

正如你所看到的,有在定义的开始宏,叫PyObject_HEAD, this macro’s goal is the standardization of Python objects and will expand to another structure that contains a pointer to a type object (which defines initialization methods, allocators, etc) and also a field with a reference counter.

有Python的API中的两个额外的宏叫Py_INCREF()andPy_DECREF(),用于递增和递减的ference counter of Python objects. Multiple entities can borrow or own a reference to other objects (the reference counter is increased), and only when this reference counter reaches zero (when all references get destroyed), Python will automatically delete the memory from that object using its garbage collector.

你可以阅读更多有关Python C / ++扩展here

Funny fact:这是在许多应用亚洲金博宝中非常普遍使用小整数作为索引,计数器等为了提高效率,官方CPython interpretercaches the integers from -5 up to 256. For that reason, the statement一个= 200;B = 200;A是B将会True,而声明一个= 300;B = 300;A是B将会

Zero-copy PyTorch Tensor to Numpy and vice-versa

PyTorch has its own Tensor representation, which decouples PyTorch internal representation from external representations. However, as it is very common, especially when data is loaded from a variety of sources, to have Numpy arrays everywhere, therefore we really need to make conversions between Numpy and PyTorch tensors. For that reason, PyTorch provides two methods calledfrom_numpy()andnumpy(),其将一个numpy的阵列,反之亦然,分别PyTorch阵列和。如果我们看就是被称为一个numpy的数组转换成PyTorch张量的代码,我们可以得到在PyTorch的内部表示更多的见解:

at::Tensor tensor_from_numpy(PyObject* obj) { if (!PyArray_Check(obj)) { throw TypeError("expected np.ndarray (got %s)", Py_TYPE(obj)->tp_name); } auto array = (PyArrayObject*)obj; int ndim = PyArray_NDIM(array); auto sizes = to_aten_shape(ndim, PyArray_DIMS(array)); auto strides = to_aten_shape(ndim, PyArray_STRIDES(array)); // NumPy strides use bytes. Torch strides use element counts. auto element_size_in_bytes = PyArray_ITEMSIZE(array); for (auto& stride : strides) { stride /= element_size_in_bytes; } // (...) - omitted for brevity void* data_ptr = PyArray_DATA(array); auto& type = CPU(dtype_to_aten(PyArray_TYPE(array))); Py_INCREF(obj); return type.tensorFromBlob(data_ptr, sizes, strides, [obj](void* data) { AutoGIL gil; Py_DECREF(obj); }); }

(代码从tensor_numpy.cpp)

正如你可以从这个代码中看到,PyTorch是获得来自numpy的表示所有的信息(数组元数据),然后创建自己的。然而,如可以从标记线18注意,PyTorch是得到一个指针到内部numpy的阵列的原始数据而不是复制。这意味着PyTorch将创建该数据的参考,与用于原始数据张量的numpy的数组对象共享相同的存储器区域。

此外,还有一个重要的点这里:当numpy的数组对象超出范围了,并得到一个零引用计数,它会被垃圾收集和destroyed,这就是为什么有在管线20中的numpy的阵列对象的引用计数的增量。

After this, PyTorch will create a new Tensor object from this Numpy data blob, and in the creation of this new Tensor it passes the borrowed memory data pointer, together with the memory size and strides as well as a function that will be used later by the Tensor Storage (we’ll discuss this in the next section) to release the data by decrementing the reference counting to the Numpy array object and let Python take care of this object life cycle.

ThetensorFromBlob()方法将创建一个新的张量,但只有创建一个新的“存储”这个张量之后。存储是其中实际数据指针将被存储(而不是在张量结构本身)。这需要我们去了解下部分张量储量

张量存储

The actual raw data of the Tensor is not directly kept in the Tensor structure, but on another structure called Storage, which in turn is part of the Tensor structure.

As we saw in the previous code fromtensor_from_numpy(), there is a call fortensorFromBlob()that will create a Tensor from the raw data blob. This last function will call another function called storageFromBlob() that will, in turn, create a storage for this data according to its type. In the case of a CPU float type, it will return a newCPUFloatStorage实例。

该CPUFloatStorage基本上是与各地实际的存储结构的效用函数的包装称为THFloatStorage我们下面显示:

typedef struct THStorage { real *data; ptrdiff_t size; int refcount; char flag; THAllocator *allocator; void *allocatorContext; struct THStorage *view; } THStorage;

(代码从THStorage.h)

正如你所看到的,THStorageholds a pointer to the raw data, its size, flags and also an interesting field called分配器that we’ll soon discuss. It is also important to note that there is no metadata regarding on how to interpret the data inside theTHStorage, this is due to the fact that the storage is “dumb” regarding of its contents and it is the Tensor responsibility to know how to “view” or interpret this data.

由此看来,你已经大概意识到,我们可以有指向同一个存储,但与此不同的数据视图多张量,这就是为什么有不同的形状观看张量(但保持相同数量的元素),效率非常高。此Python代码下面示出的是,在存储器中的数据指针被改变张量认为其数据的方式后共享:

>>> tensor_a = torch.ones((3, 3)) >>> tensor_b = tensor_a.view(9) >>> tensor_a.storage().data_ptr() == tensor_b.storage().data_ptr() True

As we can see in the example above, the data pointer on the storage of both Tensors are the same, but the Tensors represent a different interpretation of the storage data.

现在,我们的7号线锯THFloatStorage结构中,存在一个指向THAllocator构建那里。而且因为它带来了关于可用亚洲金博宝于分配存储数据分配器的灵活性,这是非常重要的。这种结构通过下面的代码表示:

typedef结构THAllocator {无效*(* malloc的)(无效*,ptrdiff_t的);无效*(* realloc的)(无效*,无效*,ptrdiff_t的);空隙(*免费)(无效*,无效*);} THAllocator;

(代码从THAllocator.h)

正如你所看到的,re are three function pointer fields in this structure to define what an allocator means: a malloc, realloc and free. For CPU-allocated memory, these functions will, of course, relate to the traditional malloc/realloc/free POSIX functions, however, when we want a storage allocated on GPUs we’ll end up using the CUDA allocators such as thecudaMallocHost(), like we can see in theTHCudaHostAllocatormalloc function below:

static void *THCudaHostAllocator_malloc(void* ctx, ptrdiff_t size) { void* ptr; if (size < 0) THError("Invalid memory size: %ld", size); if (size == 0) return NULL; THCudaCheck(cudaMallocHost(&ptr, size)); return ptr; }

(代码从THCAllocator.c)

You probably noticed a pattern in the repository organization, but it is important to keep in mind these conventions when navigating the repository, as summarized here (taken from thePyTorch LIB自述):

  • TH=TorcH
  • THC=TorcHCuda
  • THCS=TorcHCudaS解析
  • THCUNN=TorcHCUdaNeuralNetwork
  • THD=TorcHDistributed
  • THNN=TorcHNeuralNetwork
  • THS=TorcH 2 S解析

这种约定也存在于功能/类名和其他对象,所以它始终保持这些模式的一点是重要的。虽然你可以找到在TH代码CPU分配器,你会发现在THC代码CUDA分配器。

Finally, we can see the composition of the main TensorTHTensor结构体:

typedef结构THTensor {*的int64_t大小;*的int64_t步幅;INT n标注;THStorage *存储;ptrdiff_t的storageOffset;INT引用计数;焦标志;} THTensor;

(从代码THTensor.h)

正如你所看到的,主THTensor结构保持的尺寸/步幅/尺寸/偏移/等以及存储(THStorage),用于张量数据。

We can summarize all this structure that we saw in the diagram below:

现在,当我们有要求,如多处理,我们希望将多个不同进程之间共享数据的张,我们需要一个共享内存的方式来解决这个问题,否则,每次另一个进程需要一个张量,甚至当你想实现亚洲金博宝Hogwild训练过程,所有不同的进程将写入同一个存储区(其中的参数),你需要的进程之间进行复制,这是非常低效的。亚洲金博宝因此,我们将在下一节讨论一种特殊类型的存储共享内存的。

Shared Memory

共享内存可以根据平台支持多种不同的方式来实现。PyTorch支持其中的一些,但是为了简单起见,我会在这里讨论关于使用CPU(而非GPU)的MacOS会发生什么。由于PyTorch支持多种共享存储器方法,这部分是有点棘手把握成,因为它涉及间接在代码多个级别。

PyTorch provides a wrapper around the Python模块,并且可以从被导入火炬。多。他们在各地官方Python多这个包装实施的变化做是为了确保每次张量放在一个队列或共享与另一个进程,PyTorch将确保只对共享内存的句柄将被共享,而不是亚洲金博宝张量的新的完整副本。

现在,很多人不知道从PyTorch张量方法称为share_memory_(), however, this function is what triggers an entire rebuild of the storage memory for that particular Tensor. What this method does is to create a region of shared memory that can be used among different processes. This function will, in the end, call this following function below:

static THStorage* THPStorage_(newFilenameStorage)(ptrdiff_t size) { int flags = TH_ALLOCATOR_MAPPED_SHAREDMEM | TH_ALLOCATOR_MAPPED_EXCLUSIVE; std::string handle = THPStorage_(__newHandle)(); auto ctx = libshm_context_new(NULL, handle.c_str(), flags); return THStorage_(newWithAllocator)(size, &THManagedSharedAllocator, (void*)ctx); }

(从代码StorageSharing.cpp)

正如你所看到的,这个功能将创建使用称为特殊分配器另一个存储THManagedSharedAllocator。该功能首先定义一些标志,然后它创建一个手柄,其在格式的字符串/torch_[process id]_[random number],之后,它就会使用特殊创建一个新的存储THManagedSharedAllocator。该分配器具有函数指针称为内部PyTorch库libshm,将实施Unix Domain Socket通信共享所述共享存储器区域的把手。这是分配的实际情况特殊,它是一种“智能分配器”,因为它包含了通信控制逻辑以及它使用另一个名为分配器THRefcountedMapAllocatorthat will be responsible for creating the actual shared memory region and callmmap()的to map this region to the process virtual address space.

Note: when a method ends with a underscore in PyTorch, such as the method calledshare_memory_(),这意味着,该方法具有就地效果,它将改变当前对象,而不是创建一个新的与改进的。

I’ll now show a Python example of one processing using the data from a Tensor that was allocated on another process by manually exchanging the shared memory handle:

This is executed in the process A:

>>>进口炬>>> tensor_a = torch.ones((5,5))>>> tensor_a 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [火炬。FloatTensor of size 5x5] >>> tensor_a.is_shared() False >>> tensor_a = tensor_a.share_memory_() >>> tensor_a.is_shared() True >>> tensor_a_storage = tensor_a.storage() >>> tensor_a_storage._share_filename_() (b'/var/tmp/tmp.0.yowqlr', b'/torch_31258_1218748506', 25)

在该代码中,在执行process A, we create a new Tensor of 5×5 filled with ones. After that we make it shared and print the tuple with the Unix Domain Socket address as well as the handle. Now we can access this memory region from anotherprocess Bas shown below:

Code executed in the process B:

>>>进口炬>>> tensor_a = torch.Tensor()>>> tuple_info =(B '/ var / tmp中/ tmp.0.yowqlr',B '/ torch_31258_1218748506',25)>>>存储=炬。Storage._new_shared_filename(* tuple_info)>>> tensor_a = torch.Tensor(存储)。查看((5,5))1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 [5×5大小的torch.FloatTensor]

正如你所看到的,使用关于Unix域套接字的地址和手柄的元组信息,我们能够从另一个访问的过程中张量存储。如果您更改此张量process B, you’ll also see that it will reflect in theprocess Abecause these Tensors are sharing the same memory region.

DLPack: a hope for the Deep Learning frameworks Babel

Now I would like to talk about something recent in the PyTorch code base, that is calledDLPack。DLPack是一个内存张量结构,其将允许交换张量数据的一个开放的标准化框架之间, and what is quite interesting is that since this memory representation is standardized and very similar to the memory representation already in use by many frameworks, it will allow azero-copy data sharing between frameworks, which is a quite amazing initiative given the variety of frameworks we have today without inter-communication among them.

这必将有助于克服“孤岛模式”,我们今天在MXNet,PyTorch,等张量表示之间,并且将允许开发者框架和标准化能够给框架的所有好处之间的混合架构的操作。

DLPack的核心操作系统被称为结构非常简单亚洲金博宝DLTensor, 如下所示:

/ * !* \短暂纯C张量对象,并不等内容e memory. */ typedef struct { /*! * \brief The opaque data pointer points to the allocated data. * This will be CUDA device pointer or cl_mem handle in OpenCL. * This pointer is always aligns to 256 bytes as in CUDA. */ void* data; /*! \brief The device context of the tensor */ DLContext ctx; /*! \brief Number of dimensions */ int ndim; /*! \brief The data type of the pointer*/ DLDataType dtype; /*! \brief The shape of the tensor */ int64_t* shape; /*! * \brief strides of the tensor, * can be NULL, indicating tensor is compact. */ int64_t* strides; /*! \brief The offset in bytes to the beginning pointer to data */ uint64_t byte_offset; } DLTensor;

(代码从dlpack.h)

正如你所看到的,re is a data pointer for the raw data as well as shape/stride/offset/GPU vs CPU, and other metadata information about the data that theDLTensor指向。

There is also a managed version of the tensor that is calledDLManagedTensor,其中框架可以提供一个框架,并且可以通过谁借了张量来通知资源不再需要其他的框架,框架调用也是一个“删除器”的功能。

In PyTorch, if you want to convert to or from a DLTensor format, you can find both C/C++ methods for doing that or even in Python you can do that as shown below:

从进口torch.utils炬导入dlpack吨= torch.ones((5,5))(DL)= dlpack.to_dlpack(t)的

这Python函数将调用toDLPack从宏正函数,如下所示:

DLManagedTensor * toDLPack(常量张量&SRC){ATenDLMTensor * atDLMTensor(新ATenDLMTensor);atDLMTensor->手柄= SRC;atDLMTensor-> tensor.manager_ctx = atDLMTensor;atDLMTensor-> tensor.deleter =&删除器;atDLMTensor-> tensor.dl_tensor.data = src.data_ptr();的int64_t DEVICE_ID = 0;如果(src.type()is_cuda()。){DEVICE_ID = src.get_device();} atDLMTensor-> tensor.dl_tensor.ctx = getDLContext(src.type(),DEVICE_ID);atDLMTensor-> tensor.dl_tensor.ndim = src.dim();atDLMTensor-> tensor.dl_tensor.dtype = getDLDataType(src.type()); atDLMTensor->tensor.dl_tensor.shape = const_cast(src.sizes().data()); atDLMTensor->tensor.dl_tensor.strides = const_cast(src.strides().data()); atDLMTensor->tensor.dl_tensor.byte_offset = 0; return &(atDLMTensor->tensor); }

正如你所看到的,这是一个非常简单的转换,铸造从PyTorch格式的DLPack格式,元数据和分配的指针内部张量数据表示。

我真的希望有更多的框架,采用这种标准,必将给生态效益。这也是有趣的是,与潜在的整合Apache Arrowwould be amazing.

就是这样,我希望你喜欢这个长的帖子!

– Christian S. Perone

引用本文为:基督教S. Perone,“PyTorch - 内部建筑之旅”,在Terra Incognita, 12/03/2018,//www.cpetem.com/2018/03/pytorch-internal-architecture-tour/

隐私保护使用InferSent的嵌入和安全两方计算句子语义相似度

Privacy-preserving Computation

隐私保护计算或安全计算是加密的子场,其中两个(两方或2PC)或多个(多方,或MPC)当事人可以没有关于当事人私人输入数据揭示信息一起评估功能彼此。这个问题和第一个解决方案,它在1982年由安德鲁姚明在了后来被称为“做了一个惊人的突破进行了介绍姚氏百万富翁问题“。

在姚明的百万富翁问题是两个百万富翁,Alice和Bob,谁有兴趣知道其中哪些是更丰富,但不透露to each other their actual wealth. In other words, what they want can be generalized as that: Alice and Bob want jointly compute a function securely, without knowing anything other than the result of the computation on the input data (that remains private to them).

为了使问题具体,Alice有量的,如$ 10和Bob拥有量B,如$ 50,他们想知道的是哪一个是较大的,没有鲍勃揭示量B给Alice或翘透露出什么量的给Bob。这是要注意同样重要的是,我们也不想在第三方信任,否则问题将只是向可信方信息交换的一个简单的协议。

Formally what we want is to jointly evaluate the following function:

R = F(A,B)

Such as the private valuesAandB举行私人到它的唯一拥有者,并在结果r将会known to just one or both of the parties.

It seems very counterintuitive that a problem like that could ever be solved, but for the surprise of many people, it is possible to solve it on some security requirements. Thanks to the recent developments in techniques such as FHE (Fully Homomorphic EncryptionOblivious Transfer,乱码电路, problems like that started to get practical for real-life usage and they are being nowadays being used by many companies in applications such as information exchange, secure location, advertisement, satellite orbit collision avoidance, etc.

我不打算进入这些技术细节,但如果你有兴趣在OT(不经意传输)背后的直觉,你一定要读由Craig Gidney完成了惊人的解释here。There are also, of course, many different protocols for doing 2PC or MPC, where each one of them assumes some security requirements (semi-honest, malicious, etc), I’m not going to enter into the details to keep the post focused on the goal, but you should be aware of that.

问题:句子相似度

我们要实现什么是使用隐私保护的计算来计算句子之间的相似性,但不透露句子的内容。只是为了给一个具体的例子:鲍勃拥有一家公司,拥有许多不同的项目的句子,如描述:“This project is about building a deep learning sentiment analysis framework that will be used for tweets“, and Alice who owns another competitor company, has also different projects described in similar sentences.What they want to do is to jointly compute the similarity between projects in order to find if they should be doing partnership on a project or not, however, and this is the important point: Bob doesn’t want Alice to know the project descriptions and neither Alice wants Bob to be aware of their projects, they want to know the closest match between the different projects they run, butwithout disclosing该项目的想法(项目说明)。

句子相似度比较

现在,我们怎么能没有透露关于该项目的描述信息,交换有关Alice和Bob的项目句子的信息?

一种简单的方式来做到这一点是只计算句子的哈希值,然后只比较哈希值以检查它们是否匹配。然而,这假设的描述是完全一样的,再说,如果句子的熵是小(如小句),有人用合理的计算能力可以尝试恢复的句子。

Another approach for this problem (this is the approach that we’ll be using), is to compare the sentences in the sentence embeddings space. We just need to create sentence embeddings using a Machine Learning model (we’ll useInferSent更高版本),然后比较句子的嵌入物。不过,这种做法也引起了另一个问题:如果什么鲍勃或翘火车一Seq2Seq模式,将从对方回项目的大致描述的嵌入物去?

认为可以收回他们给出的嵌入句子的大致描述这不是没有道理的。这就是为什么我们将使用两方安全计算用于计算嵌入物的相似性,在某种程度上Alice和Bob将计算的嵌入的相似性不透露their embeddings, keeping their project ideas safe.

整个流程如下图,其中Alice和Bob共享相同的机器学习模型,之后,他们利用这个模型,从句子的嵌入进去描述,随后在嵌入空间相似的安全计算。

整个过程的框图概览。

Generating sentence embeddings with InferSent

Bi-LSTM max-pooling网络。来源:监督勒arning of Universal Sentence Representations from Natural Language Inference Data. Alexis Conneau et al.

InferSent是Facebook开发的万能句表示的NLP技术,用途监督培训,生产出高转让交涉。

They used a Bi-directional LSTM with attention that consistently surpassed many unsupervised training methods such as the SkipThought vectors. They also provide aPytorch implementation我们将用它来生成句子的嵌入。

注意:even if you don’t have GPU, you can have reasonable performance doing embeddings for a few sentences.

第一步,以产生句子的嵌入是下载和加载预训练InferSent模型:

进口numpy的从NP进口火炬#训练模型:https://github.com/facebookresearch/Infer金宝博游戏网址Sent GLOVE_EMBS = '../dataset/GloVe/glove.840B.300d.txt' INFERSENT_MODEL = 'infersent.allnli.pickle' #负荷训练InferSent模型模型= torch.load(INFERSENT_MODEL,map_location =拉姆达存储,在上述:存储)model.set_glove_path(GLOVE_EMBS)model.build_vocab_k_words(K = 100000)

现在,我们需要定义一个相似性度量来比较两个向量,并为实现这一目标,我会余弦相似性(188betcom网页版),因为它是非常简单的:

cos(\pmb x, \pmb y) = \frac {\pmb x \cdot \pmb y}{||\pmb x|| \cdot ||\pmb y||}

正如你所看到的,如果我们有两个单位向量(与标准1向量),在公式的分母两个方面将是1,我们将能够去除公式的分母整体,只留下:

cos(\hat{x}, \hat{y}) =\hat{x} \cdot\hat{y}

所以,如果我们规范我们的向量有一个单位范(这就是为什么向量都穿着上面公式中的帽子),我们可以使余弦相似变成只是一个简单的点积的计算。这将有助于我们很多在计算类似距离后,当我们用一个框架来做到这一点的点产品的安全计算。

因此,下一步是定义,将采取一些句子文本,并将其转发给模型生成的嵌入,然后将它们归到单位向量函数:

# This function will forward the text into the model and # get the embeddings. After that, it will normalize it # to a unit vector. def encode(model, text): embedding = model.encode([text])[0] embedding /= np.linalg.norm(embedding) return embedding

正如你所看到的,这个功能是非常简单的,它会将文本到模型中,然后将被嵌入标准划分嵌入载体。

现在,对于实际的原因,我将在以后使用整数运算计算的相似性,但是,通过InferSent产生的嵌入当然真正的价值。出于这个原因,你会在下面的代码中看到,我们创建另一个函数scale the float values and remove the radix pointand将它们转换为整数。还有另一个重要的问题,我们将在以后使用安全计算框架doesn’t allow signed integers, so we also need to clip the embeddings values between 0.0 and 1.0. This will of course cause some approximation errors, however, we can still get very good approximations after clipping and scaling with limited precision (I’m using 14 bits for scaling to avoid overflow issues later during dot product computations):

#此功能将为了缩放嵌入到#去掉小数点。DEF比例(嵌入):SCALE = 1 << 14 scale_embedding = np.clip(嵌入,0.0,1.0)* SCALE返回scale_embedding.astype(np.int32)

你可以在你的安全使用浮点计算ions and there are a lot of frameworks that support them, however, it is more tricky to do that, and for that reason, I used integer arithmetic to simplify the tutorial. The function above is just a hack to make it simple. It’s easy to see that we can recover this embedding later without too much loss of precision.

Now we just need to create some sentence samples that we’ll be using:

#爱丽丝句子alice_sentences =列表[“我的猫很喜欢我的键盘走了”,“我想爱抚我的猫”,]#鲍勃的句子bob_sentences名单= [“猫总是走在我的键盘”,]

并将其转换为嵌入物:

#爱丽丝句子alice_sentence1 =编码(型号,alice_sentences [0])= alice_sentence2编码(型号,alice_sentences [1])#鲍勃句bob_sentence1 =编码(型号,bob_sentences [0])

因为我们现在的句子,每个句子也被标准化,我们只要通过执行向量之间的点积计算亚洲金博宝余弦相似性:

>>> np.dot(bob_sentence1, alice_sentence1) 0.8798542 >>> np.dot(bob_sentence1, alice_sentence2) 0.62976325

正如我们所看到的,鲍勃的第一句是不是爱丽丝第二句(〜0.62)最相似(〜0.87)与爱丽丝第一句话。

Since we have now the embeddings, we just need to convert them to scaled integers:

# Scale the Alice sentence embeddings alice_sentence1_scaled = scale(alice_sentence1) alice_sentence2_scaled = scale(alice_sentence2) # Scale the Bob sentence embeddings bob_sentence1_scaled = scale(bob_sentence1) # This is the unit vector embedding for the sentence >>> alice_sentence1 array([ 0.01698913, -0.0014404 , 0.0010993 , ..., 0.00252409, 0.00828147, 0.00466533], dtype=float32) # This is the scaled vector as integers >>> alice_sentence1_scaled array([278, 0, 18, ..., 41, 135, 76], dtype=int32)

现在,这些嵌入物的比例整数,我们就可以进入第二部分,我们会做双方的安全计算。

两方安全计算

In order to perform secure computation between the two parties (Alice and Bob), we’ll use theABY framework。ABY实现了许多差异安全计算方案,并允许你描述你的计算像下面的图片,其中姚明的百万富翁的问题描述描绘的电路:

Yao’s Millionaires problem. Taken from ABY documentation (https://github.com/encryptogroup/ABY).

As you can see, we have two inputs entering in one GT GATE (greater than gate) and then a output. This circuit has a bit length of 3 for each input and will compute if the Alice input is greater than (GT GATE) the Bob input. The computing parties then secret share their private data and then can use arithmetic sharing, boolean sharing, or Yao sharing to securely evaluate these gates.

ABY是很容易使用,因为你可以描述你的输入,股票,盖茨和它会做休息,你如创建套接字通信信道,在需要的时候进行数据交换等。然而,实施完全是用C ++编写,并I’m not aware of any Python bindings for it (a great contribution opportunity).

幸运的是,对于一个ABY实现的示例可以为我们做点积计算,例如在这里。I won’t replicate the example here, but the only part that we have to change is to read the embedding vectors that we created before instead ofgenerating random载体和增加的比特长度为32比特。

After that, we just need to execute the application on two different machines (or by emulating locally like below):

#这将执行的服务器部分,所述-r 0指定的角色(服务器)#和载体的-n 4096只限定了尺寸(InferSent生成#4096维的嵌入)。〜#./innerproduct -r 0 -n 4096#而另一个进程相同(或另一台机器,但是对于另一个#机执行,你必须明明指定IP)。〜#./innerproduct -r 1 -n 4096

而我们得到如下结果:

alice_sentence1的内积和alice_sentence2的bob_sentence1 = 226691917内产品和bob_sentence1 = 171746521

Even in the integer representation, you can see that the inner product of the Alice’s first sentence and the Bob sentence is higher, meaning that the similarity is also higher. But let’s now convert this value back to float:

>>> SCALE = 1 << 14#这是点的产品,我们应该得到>>> np.dot(alice_sentence1,bob_sentence1)0.8798542#这是内部的产品,我们在安全计算>>> 226691917 / SCALE **了2.00.8444931#这是点的产品,我们应该得到>>> np.dot(alice_sentence2,bob_sentence1)0.6297632#这是内部的产品,我们在安全计算得到>>> 171746521 / SCALE ** 2.0 0.6398056

正如你所看到的,我们得到了很好的近似,即使在低亚洲金博宝精度数学和无符号整数需求的存在。Of course that in real-life you won’t have the two values and vectors, because they’re supposed to be hidden, but the changes to accommodate that are trivial, you just need to adjust ABY code to load only the vector of the party that it is executing it and using the correct IP addresses/port of the both parties.

我希望你喜欢它 !

– Christian S. Perone

Cite this article as: Christian S. Perone, "Privacy-preserving sentence semantic similarity using InferSent embeddings and secure two-party computation," inTerra Incognita,22/01/2018,//www.cpetem.com/2018/01/privacy-preserving-infersent/