Gandiva, using LLVM and Arrow to JIT and evaluate Pandas expressions

介绍

这是2020年的帖子,所以新年快乐给你所有!

我LLVM一个巨大的风扇,因为11年前,当我开始玩它来JIT数据结构如AVLS,然后稍后JIT限制AST树从TensorFlow图JIT本机代码。此后,LLVM演变成最重要的编译器框架的生态系统之一,是由很多重要的开源项目采用了时下。

一个很酷的项目,我最近才知道的就是Gandiva。Gandiva被开发Dremio再后来donated to Apache Arrow(荣誉给Dremio团队为)。Gandiva的主要思想是,它提供了一个编译器生成LLVM IR可以在分批操作Apache的箭。Gandiva被用C ++编写,并配有很多实现构建表达式树,可以是使用JIT'ed LLVM不同的功能。这种设计的一个很好的特性是,它可以使用LLVM来自动优化复杂的表达式,增加了原生的目标平台矢量如AVX同时箭批量操作和执行本机代码,以计算表达式。

The image below gives an overview of Gandiva:

的Gandiva如何工作的概述。图片来自:https://www.dremio.com/announcing-gandiva-initiative-for-apache-arrow

在this post I’ll build a very simple expression parser supporting a limited set of operations that I will use to filter a Pandas DataFrame.

建设有Gandiva简单的表达

在this section I’ll show how to create a simple expression manually using tree builder from Gandiva.

使用Gandiva Python绑定到JIT和表达

建立我们的解析器和表达式生成器表达式之前,让我们手动建立与Gandiva一个简单的表达。首先,我们将创建一个简单的熊猫数据框以数字从0.0到9.0:

进口熊猫作为PD进口pyarrow为PA进口pyarrow.gandiva作为gandiva#创建一个简单的熊猫数据帧DF = pd.DataFrame({ “×”:[1.0 * I为i的范围(10)]})表= pa.Table.from_pandas(DF)架构= pa.Schema.from_pandas(DF)

我们转换的数据帧到Arrow Table, it is important to note that in this case it was a zero-copy operation, Arrow isn’t copying data from Pandas and duplicating the DataFrame. Later we get theschema从表中,包含列类型和其他元数据。

在那之后,我们要使用Gandiva建立下面的表达式来过滤数据:

(X> 2.0)和(x <6.0)

这个表达式将使用节点从Gandiva可以了:

builder = gandiva.TreeExprBuilder() # Reference the column "x" node_x = builder.make_field(table.schema.field("x")) # Make two literals: 2.0 and 6.0 two = builder.make_literal(2.0, pa.float64()) six = builder.make_literal(6.0, pa.float64()) # Create a function for "x > 2.0" gt_five_node = builder.make_function("greater_than", [node_x, two], pa.bool_()) # Create a function for "x < 6.0" lt_ten_node = builder.make_function("less_than", [node_x, six], pa.bool_()) # Create an "and" node, for "(x > 2.0) and (x < 6.0)" and_node = builder.make_and([gt_five_node, lt_ten_node]) # Make the expression a condition and create a filter condition = builder.make_condition(and_node) filter_ = gandiva.make_filter(table.schema, condition)

This code now looks a little more complex but it is easy to understand. We are basically creating the nodes of a tree that will represent the expression we showed earlier. Here is a graphical representation of what it looks like:

检查所生成的LLVM IR

不幸的是,还没有找到一种方法来转储用箭头的Python绑定生成LLVM IR,但是,我们可以只使用C ++ API构建相同的树,然后查看生成的LLVM IR:

自动field_x =字段( “X”,FLOAT32());自动模式=箭头::架构({field_x});自动node_x = TreeExprBuilder :: MakeField(field_x);汽车2 = TreeExprBuilder :: MakeLiteral((则float_t)2.0);汽车6 = TreeExprBuilder :: MakeLiteral((则float_t)6.0);自动gt_five_node = TreeExprBuilder :: MakeFunction( “GREATER_THAN”,{node_x,二},箭头::布尔());自动lt_ten_node = TreeExprBuilder :: MakeFunction( “LESS_THAN”,{node_x,六},箭头::布尔());自动and_node = TreeExprBuilder :: MakeAnd({gt_five_node,lt_ten_node});自动条件= TreeExprBuilder :: MakeCondition(and_node);的std :: shared_ptr的<过滤>过滤器; auto status = Filter::Make(schema, condition, TestConfiguration(), &filter);

上面的代码是一样的Python代码,但是using the C++ Gandiva API. Now that we built the tree in C++, we can get the LLVM Module and dump the IR code for it. The generated IR is full of boilerplate code and the JIT’ed functions from the Gandiva registry, however the important parts are show below:

;Function Attrs: alwaysinline norecurse nounwind readnone ssp uwtable define internal zeroext i1 @less_than_float32_float32(float, float) local_unnamed_addr #0 { %3 = fcmp olt float %0, %1 ret i1 %3 } ; Function Attrs: alwaysinline norecurse nounwind readnone ssp uwtable define internal zeroext i1 @greater_than_float32_float32(float, float) local_unnamed_addr #0 { %3 = fcmp ogt float %0, %1 ret i1 %3 } (...) %x = load float, float* %11 %greater_than_float32_float32 = call i1 @greater_than_float32_float32(float %x, float 2.000000e+00) (...) %x11 = load float, float* %15 %less_than_float32_float32 = call i1 @less_than_float32_float32(float %x11, float 6.000000e+00)

正如你所看到的,在IR我们可以看到调用功能less_than_float32_float_32greater_than_float32_float32这是(在这种情况下很简单的)Gandiva功能做浮动比亚洲金博宝较。通过查看函数名前缀注意函数的专业化。

什么是颇为有趣的是,LLVM将适用于所有的优化在这个代码,它会为目标平台的高效的本地代码同时戈黛娃和LLVM将采取确保内存对齐将成为扩展,如AVX用于正确的护理矢量。

这IR代码我发现是不是真正执行了一个,但优化的一个。和在优化的一个我们可以看到,内联LLVM的功能,如显示在下面的优化代码的一部分:

%x.us =负载浮子,浮子*%10,对准4%11 = FCMP OGT浮子%x.us,2.000000e + 00%12 = FCMP OLT浮子%x.us,6.000000e + 00%not.or。COND =和I1%12%11

你可以看到,表达的是现在简单多了优化后的LLVM应用其强大的优化和内联很多Gandiva funcions的。

建设有Gandiva一个熊猫过滤器表达式JIT

现在,我们希望能够实现,因为大熊猫类似的东西DataFrame.query()使用Gandiva功能。我们将面临的第一个问题是,我们需要分析一个字符串,如(X> 2.0)和(x <6.0),以后我们将不得不建立使用从Gandiva树构建的Gandiva表达式树,然后评估上箭头的数据表达。

Now, instead of implementing a full parsing of the expression string, I’ll use the Python AST module to parse valid Python code and build an Abstract Syntax Tree (AST) of that expression, that I’ll be later using to emit the Gandiva/LLVM nodes.

The heavy work of parsing the string will be delegated to Python AST module and our work will be mostly walking on this tree and emitting the Gandiva nodes based on that syntax tree. The code for visiting the nodes of this Python AST tree and emitting Gandiva nodes is shown below:

class LLVMGandivaVisitor(ast.NodeVisitor): def __init__(self, df_table): self.table = df_table self.builder = gandiva.TreeExprBuilder() self.columns = {f.name: self.builder.make_field(f) for f in self.table.schema} self.compare_ops = { "Gt": "greater_than", "Lt": "less_than", } self.bin_ops = { "BitAnd": self.builder.make_and, "BitOr": self.builder.make_or, } def visit_Module(self, node): return self.visit(node.body[0]) def visit_BinOp(self, node): left = self.visit(node.left) right = self.visit(node.right) op_name = node.op.__class__.__name__ gandiva_bin_op = self.bin_ops[op_name] return gandiva_bin_op([left, right]) def visit_Compare(self, node): op = node.ops[0] op_name = op.__class__.__name__ gandiva_comp_op = self.compare_ops[op_name] comparators = self.visit(node.comparators[0]) left = self.visit(node.left) return self.builder.make_function(gandiva_comp_op, [left, comparators], pa.bool_()) def visit_Num(self, node): return self.builder.make_literal(node.n, pa.float64()) def visit_Expr(self, node): return self.visit(node.value) def visit_Name(self, node): return self.columns[node.id] def generic_visit(self, node): return node def evaluate_filter(self, llvm_mod): condition = self.builder.make_condition(llvm_mod) filter_ = gandiva.make_filter(self.table.schema, condition) result = filter_.evaluate(self.table.to_batches()[0], pa.default_memory_pool()) arr = result.to_array() pd_result = arr.to_numpy() return pd_result @staticmethod def gandiva_query(df, query): df_table = pa.Table.from_pandas(df) llvm_gandiva_visitor = LLVMGandivaVisitor(df_table) mod_f = ast.parse(query) llvm_mod = llvm_gandiva_visitor.visit(mod_f) results = llvm_gandiva_visitor.evaluate_filter(llvm_mod) return results

正如你所看到的,它的代码,我不支持每一个可能的Python表达式,但它的一个子集轻微非常简单。亚洲金博宝我们做这个班什么是基本的比较和BinOps(二元运算)的Gandiva节点的Python AST的转换节点,。我也正在改变的语义&|operators to represent AND and OR respectively, such as in Pandasquery()function.

Register as a Pandas extension

下一步是使用创建一个简单的熊猫扩展gandiva_query()方法,我们创建了:

@ pd.api.extensions.register_dataframe_accessor( “gandiva”)类GandivaAcessor:高清__init __(自我,pandas_obj):self.pandas_obj = pandas_obj高清查询(个体经营,查询):返回LLVMGandivaVisitor.gandiva_query(self.pandas_obj,查询)

这就是它,现在我们可以使用这个扩展做的事情,例如:

df = pd.DataFrame({"a": [1.0 * i for i in range(nsize)]}) results = df.gandiva.query("a > 10.0")

正如我们已经注册了熊猫的扩展名为gandivathat is now a first-class citizen of the Pandas DataFrames.

Let’s create now a 5 million floats DataFrame and use the newquery()方法对其进行过滤:

DF = pd.DataFrame({ “一”:[1.0 * I为i的范围(50000000)]})df.gandiva.query( “一<4.0”)#这将输出:#阵列([0,1,2,3],D型细胞= UINT32)

Note that the returned values are the indexes satisfying the condition we implemented, so it is different than the Pandasquery()that returns the data already filtered.

我做了一些基准测试,我们发现甘戴瓦ually always faster than Pandas, however I’ll leave proper benchmarks for a next post on Gandiva as this post was to show how you can use it to JIT expressions.

That’s it ! I hope you liked the post as I enjoyed exploring Gandiva. It seems that we will probably have more and more tools coming up with Gandiva acceleration, specially for SQL parsing/projection/JITing. Gandiva is much more than what I just showed, but you can get started now to understand more of its architecture and how to build the expression trees.

- 基督教S. Perone

引用本文为:基督教S. Perone,“Gandiva,使用LLVM和箭JIT和评估熊猫表情,”在亚洲金博宝未知领域,19/01/2020,//www.cpetem.com/2020/01/gandiva-using-llvm-and-arrow-to-jit-and-evaluate-pandas-expressions/

JIT生成本机代码使用Python和LLVM TensorFlow计算图

更新:黑客新闻这里讨论

该TensorFlow计算图

tensorlogo

其中一个最惊人的组件TensorFlowarchitecture is the computation graph that can be serialized using协议缓冲。This computation graph follows a well-defined format (点击这里对于原文件),并描述了您指定的计算(它可以是一个深度学习模型像CNN,一个简单的逻辑回归,甚至任何计算你想要的)。举例来说,这里是一个非常简单的TensorFlow计算图,我们将在本教程中(亚洲金博宝使用TensorFlow的Python API)使用的例子:

进口tensorflow与tf.Session()作为编码扩频通信TF:input_placeholder = tf.placeholder(tf.int32,1,名称= “输入”)sub_op = tf.sub(input_placeholder,tf.constant(2,D型= tf.int32))add_op = tf.add(sub_op,tf.constant(5,D型细胞= tf.int32))输出= tf.add(add_op,tf.constant(100,D型细胞= tf.int32),名称= “输出”)tf.train.write_graph(sess.graph_def, “”, “graph.pb”,真)
Representation of the computation graph.
Representation of the computation graph.

正如你所看到的,这是一个非常简单的计算图表。亚洲金博宝首先,我们定义将一直保持输入张量的占位符,并且我们指定应该发生的使用该输入张量作为输入数据的计算后。在这里我们也可以看到,我们定义这个图的两个重要的节点,一个被称为“输入” (the aforementioned placeholder) and the other is called “产量“, that will hold the result of the final computation. This graph is the same as the following formula for a scalar:输出=(((输入 -  2)-5)100),其中I有意添加的冗余操作,以在后面看到LLVM恒定传播。

在代码的最后一行,我们坚持这个计算图(including the constant values)成序列化的protobuf文件。最后真正parameter is to output a textual representation instead of binary, so it will produce the following human-readable output protobuf file (I omitted a part of it for brevity):

node { name: "input" op: "Placeholder" attr { key: "dtype" value { type: DT_INT32 } } attr { key: "shape" value { shape { dim { size: 1 } } } } } node { name: "Const" op: "Const" attr { key: "dtype" value { type: DT_INT32 } } attr { key: "value" value { tensor { dtype: DT_INT32 tensor_shape { } int_val: 2 } } } } --- >(omitted for brevity) < --- node { name: "output" op: "Add" input: "Add" input: "Const_2" attr { key: "T" value { type: DT_INT32 } } } versions { producer: 9 }

这是一个非常简单的图亚洲金博宝表,图表TensorFlow是其实从来没有这么简单,因为TensorFlow模型很可能包含取决于你指定,专为深度学习模型的模型超过300个节点。

我们将使用上图展示了如何我们可以为这个简单的图形使用JIT本机代码LLVM框架

The LLVM Frontend, IR and Backend

LLVM-Logo-Derivative-1

该LLVM框架是构建编译器和工具链的一个非常好的,模块化和完整的生态系统。LL亚洲金博宝VM的架构如下图所示我们很重要的一个非常漂亮的描述:

LLVM编译器架构
LLVM编译器架构(AOSA / LLVM,克里斯·拉特纳)

(以上图片是LLVM架构的只是一小部分,为它的全面描述,请看到好文章由克里斯·拉特纳写的书AOSA)

Looking in the image above, we can see that LLVM provides a lot of core functionality, in the left side you see that many languages can write code for their respective language frontends, after that it doesn’t matter in which language you wrote your code, everything is transformed into a very powerful language calledLLVM IR(LLVM Intermediate Representation),这是因为你可以想像,只是汇编代码本身之前的代码的中间表示。在my opinion, the IR is the key component of what makes LLVM so amazing, because it doesn’t matter in which language you wrote your code (or even if it was a JIT’ed IR), everything ends in the same representation, and then here is where the magic happens, because the IR can take advantage of the LLVM optimizations (also known as变换和分析通行证)。

该IR生成后,可以将其送到任何LLVM后端,生成用于将LLVM(例如x86,ARM,PPC等)支持的架构原生代码,然后你可以用本机性能LLVM后,终于执行代码,也优化过程。

在使用LLVM为了JIT代码,所有你需要的是编程建立IR,创建一个执行引擎转换(在执行时间)的IR为本地代码,获取你有JIT'ed,最后执行它的函数的指针。我会在这里使用的Python LLVM结合称为llvmlite,这是非常符合Pyth亚洲金博宝on和易于使用。

JIT’ing TensorFlow Graph using Python and LLVM

流

现在,让我们使用LLVM和Python进行JIT的TensorFlow计算图表。这是绝不是全面贯彻实施, it is very simplistic approach, a oversimplification that assumes some things: a integer closure type, just some TensorFlow operations and also a single scalar support instead of high rank tensors.

So, let’s start building our JIT code; first of all, let’s import the required packages, initialize some LLVM sub-systems and also define the LLVM respective type for the TensorFlow integer type:

从进口的ctypes CFUNCTYPE,c_int的进口tensorflow作为从tensorflow.core.framework进口graph_pb2 google.protobuf进口TEXT_FORMAT从tensorflow.core.framework进口types_pb2从tensorflow.python.framework进口OPS TF导入llvmlite.ir为L1进口llvmlite.bindingas llvm llvm.initialize() llvm.initialize_native_target() llvm.initialize_native_asmprinter() TYPE_TF_LLVM = { types_pb2.DT_INT32: ll.IntType(32), }

在此之后,让我们来定义一个类来打开TensorFlow出口图形,也宣告一个方法来获得由名图的节点:

类TFGraph(对象)的:def __init __(个体,文件名= “graph.pb”,二进制=假):self.graph_def = graph_pb2.GraphDef()与打开( “graph.pb”, “RB”)为f:如果二进制:self.graph_def.ParseFromString(f.read())其他:text_format.Merge(f.read(),self.graph_def)高清get_node(个体经营,名):为节点self.graph_def.node:如果节点。命名==名称:返回节点

让我们先来定义我们的主要功能将是代码的起点开始:

DEF run_main():图表= TFGraph( “graph.pb”,假)input_node = graph.get_node( “输入”)output_node = graph.get_node( “输出”)INPUT_TYPE = TYPE_TF_LLVM [input_node.attr [ “D型”]。型] output_type = TYPE_TF_LLVM [output_node.attr [ “T”]。式]模块= ll.Module()func_type = ll.FunctionType(output_type,[INPUT_TYPE])FUNC = ll.Function(模块,func_type,名称='tensorflow_graph')func.args [0]。名称= '输入' bb_entry = func.append_basic_block(' 条目')ir_builder = ll.IRBuilder(bb_entry)

正如你可以在上面的代码中看到,我们打开序列化的protobuf图,然后得到这个图形的输入和输出节点。之后,我们还映射两个图形节点(输入/输出)与LLVM类型(从TensorFlow整数LLVM整数)的类型。我们通过定义一个LLVM模块,这是所有IR对象顶层容器然后启动。在LLVM可以包含许多不同的功能合为一体,在这里我们将只创建一个函数,将代表图,此功能将得到作为输入参数相同类型的输入节点的输入数据,然后它会返回一个值与相同类型的输出节点的。

在此之后,我们通过创建函数的入口块并使用该块我们实例IR生成器,这是一个对象,将提供给我们的构建块TensorFlow图形JIT'ing操作开始。

现在让我们来定义,会做转换TensorFlow节点成LLVM IR的实际工作中的作用:

DEF build_graph(ir_builder,图形,节点):如果node.op == “添加”:left_op_node = graph.get_node(node.input [0])= right_op_node graph.get_node(node.input [1])= left_op build_graph(ir_builder,图形,left_op_node)right_op = build_graph(ir_builder,图形,right_op_node)返回ir_builder.add(left_op,right_op)如果node.op == “子”:left_op_node = graph.get_node(node.input [0])= right_op_nodegraph.get_node(node.input [1])= left_op build_graph(ir_builder,图形,left_op_node)right_op = build_graph(ir_builder,图形,right_op_node)返回ir_builder.sub(left_op,right_op)如果node.op == “占位”:function_args =在function_args为ARG ir_builder.function.args:如果arg.name == node.name:回报ARG提高RuntimeError( “!输入[{}]未找到” 格式(node.name))如果node.op == “常量”:llvm_const_type = TYPE_TF_LLVM [node.attr [ “D型”]键入] const_value = node.attr [ “值”] tensor.int_val [0] = llvm_const_value llvm_const_type(const_value)返回llvm_const_value

在这个函数中,我们接收的参数IR生成器,我们之前创建的图形类和输出节点。该函数将递归通过IR生成器来构建LLVM IR。在这里你可以看到,我只能从TensorFlow图实施的加/减/占位符和const操作,只是为了能够支持我们先前定义的图表。

After that, we just need to define a function that will take a LLVM Module and then create a execution engine that will execute the LLVM optimization over the LLVM IR before doing the hard-work of converting the IR into native x86 code:

def create_engine(module): features = llvm.get_host_cpu_features().flatten() llvm_module = llvm.parse_assembly(str(module)) target = llvm.Target.from_default_triple() target_machine = target.create_target_machine(opt=3, features=features) engine = llvm.create_mcjit_compiler(llvm_module, target_machine) engine.finalize_object() print target_machine.emit_assembly(llvm_module) return engine

在上面的代码,你可以看到,我们首先得到的CPU功能(SSE等)到一个列表,之后我们从模块解析LLVM IR,然后我们创建一个使用最大优化级别发动机(选择= 3,大致equivalent to the GCC -O3 parameter), we’re also printing the assembly code (in my case, the x86 assembly built by LLVM).

在这里,我们只是完成了run_main()功能:

ret = build_graph(ir_builder, graph, output_node) ir_builder.ret(ret) with open("output.ir", "w") as f: f.write(str(module)) engine = create_engine(module) func_ptr = engine.get_function_address("tensorflow_graph") cfunc = CFUNCTYPE(c_int, c_int)(func_ptr) ret = cfunc(10) print "Execution output: {}".format(ret)

As you can see in the code above, we just call thebuild_graph ()方法然后使用IR生成器添加的“RET” LLVM IR指令(RET =返程)返回刚刚创建基于所述TensorFlow图表的IR函数的输出。我们也在这里写IR输出到外部文件,我将在后面使用这个LLVM IR文件为其他不同的体系结构,如ARM架构创建本地组装。最后,只得到本机代码的函数地址,创建一个Python包装了这个功能,然后用参数“10”,这将是输入数据,然后输出生成的输出值调用它。

这就是它,当然,这仅仅是一个过于简单化,但现在我们知道有我们TensorFlow车型JIT的优势。

输出LLVM IR,优化和多个体系结构的优点(ARM,PPC,86等)

例如,让我们创建了LLVM IR(使用上面所示的代码I)以下TensorFlow图的:

进口tensorflow与tf.Session()作为编码扩频通信TF:input_placeholder = tf.placeholder(tf.int32,1,名称= “输入”)sub_op = tf.sub(input_placeholder,tf.constant(2,D型= tf.int32))add_op = tf.add(sub_op,tf.constant(5,D型细胞= tf.int32))输出= tf.add(add_op,tf.constant(100,D型细胞= tf.int32),名称= “输出”)tf.train.write_graph(sess.graph_def, “”, “graph.pb”,真)

产生的LLVM IR是下面这一个:

;ModuleID = "" target triple = "unknown-unknown-unknown" target datalayout = "" define i32 @"tensorflow_graph"(i32 %"input") { entry: %".3" = sub i32 %"input", 2 %".4" = add i32 %".3", 5 %".5" = add i32 %".4", 100 ret i32 %".5" }

As you can see, the LLVM IR looks a lot like an assembly code, but this is not the final assembly code, this is just a non-optimized IR yet. Just before generating the x86 assembly code, LLVM runs a lot of optimization passes over the LLVM IR, and it will do things such as dead code elimination, constant propagation, etc. And here is the final native x86 assembly code that LLVM generates for the above LLVM IR of the TensorFlow graph:

的.text .file “<字符串>” .globl tensorflow_graph .align伪16,0×90 .TYPE tensorflow_graph,@功能tensorflow_graph:.cfi_startproc利尔103(%RDI),%eax中retq .Lfunc_end0:.size tensorflow_graph,.Lfunc_end0-tensorflow_graph .cfi_endproc.section伪 “.note.GNU栈”, “”,@ PROGBITS

As you can see, the optimized code removed a lot of redundant operations, and ended up just doing a add operation of 103, which is the correct simplification of the computation that we defined in the graph. For large graphs,你可以看到,这些优化可真厉害,因为我们重用多年来所开发的编译器优化in our Machine Learning model computation

您还可以使用所谓的“有限责任公司”一LLVM工具,可以采取一个LLVM IR文件和生成汇编为你需要的任何其他平台,例如,下面的命令行会为ARM架构的本机代码:

LLC -O3 out.ll -march =手臂-o sample.s

输出sample.sfile is the one below:

的.text .syntax统一.eabi_attribute 67, “2.09” @ Tag_conformance .eabi_attribute 6,1 @ Tag_CPU_arch .eabi_attribute 8,1 @ Tag_ARM_ISA_use .eabi_attribute 17,1 @ Tag_ABI_PCS_GOT_use .eabi_attribute 20,1 @ Tag_ABI_FP_denormal .eabi_attribute 21,1 @ Tag_ABI_FP_exceptions。eabi_attribute 23,3 @ Tag_ABI_FP_number_model .eabi_attribute 34,1 @ Tag_CPU_unaligned_access .eabi_attribute 24,1 @ Tag_ABI_align_needed .eabi_attribute 25,1 @ Tag_ABI_align_preserved .eabi_attribute 38,1 @ Tag_ABI_FP_16bit_format .eabi_attribute 14,0 @ Tag_ABI_PCS_R9_use .file “out.ll” .globltensorflow_graph .align伪2 .TYPE tensorflow_graph,%功能tensorflow_graph:@ @tensorflow_graph .fnstart @ BB#0:@%条目添加R0,R0,#103 MOV PC,LR .Lfunc_end0:.size tensorflow_graph,.Lfunc_end0-tensorflow_graph .fnend。部 “.note.GNU堆叠”, “”,%PROGBITS

正如你可以在上面看到,ARM汇编代码也只是一个“添加”汇编指令,然后返回指令。

这是非常好的,因为我们可以采取LLVM架构的天然优势。举例来说,今天的ARMjust announced the ARMv8-A with Scalable Vector Extensions (SVE)将支持2048位矢量,他们是already working补丁LLVM。在未来,一个非常好的除了LLVM将LLVM开发通行证进行分析和改造,将考虑到的机器学习模型的性质。

就是这样,我希望你喜欢的文章!是真的真棒,你可以使用Python,LLVM和TensorFlow几行做什么。

更新2016年8月22日乔什克朗茨只是指出他酷的项目所谓可能的上黑客新闻的讨论

更新2016年8月22日:TensorFlow团队实际上是工作的一个JIT(如果他们使用LLVM我不知道,但似乎最合理的方式在我看来去)。在their paper,也有对未来的工作,我在这里举一个非亚洲金博宝常重要的声明:

“我们也有一些具体的方向,提高TensorFlow的性能。这样的一个方向,是我们在刚刚即时编译器,可以采取一个TensorFlow执行的子图,也许关于张量的典型尺寸和形状的一些运行时分析信息,并可以生成此子优化的常规前期工作。这个编译器会明白执行多个优化,如环融合的,阻断和平铺为局部性,专业化为特定的形状和大小等的语义”-TensorFlow白皮书

全码

从进口的ctypes CFUNCTYPE,c_int的进口tensorflow作为从tensorflow.core.framework进口graph_pb2 google.protobuf进口TEXT_FORMAT从tensorflow.core.framework进口types_pb2从tensorflow.python.framework进口OPS TF导入llvmlite.ir为L1进口llvmlite.binding如LLVM llvm.initialize()llvm.initialize_native_target()llvm.initialize_native_asmprinter()TYPE_TF_LLVM = {types_pb2.DT_INT32:ll.IntType(32),}类TFGraph(对象)的:def __init __(个体,文件名= “graph.pb”二进制= FALSE):self.graph_def = graph_pb2.GraphDef()与开放( “graph.pb”, “RB”)作为F:如果二进制:self.graph_def.ParseFromString(f.read())其他:TEXT_FORMAT。合并(f.read(),self.graph_def)高清get_node(个体经营,名):在self.graph_def.node节点:如果node.name ==名称:返回节点DEF build_graph(ir_builder,图,节点):如果node.op == “添加”:left_op_node = graph.get_node(node.input [0])= right_op_node graph.get_node(node.input [1])= left_op build_graph(ir_builder,图形,left_op_node)right_op= build_graph(ir_builder,图形,right_op_node)返回ir_builder.add(left_op,right_op)如果node.op == “子”:left_op_node = graph.get_node(node.input [0])= right_op_node graph.get_node(node.input[1])= left_op build_graph(ir_builder,图形,left_op_node)right_op = build_graph(ir_builder,图形,right_op_node)返回ir_builder.sub(left_op,right_op)如果node.op == “占位”:function_args = ir_builder.function.args为ARG在function_args:如果arg.name == node.name:回报ARG提高RuntimeError( “输入[{}]没有找到!” 格式(node.name)。)如果node.op == “常量”:llvm_const_type =TYPE_TF_LLVM [node.attr [ “D型”]型。] const_value = node.attr [ “值”] tensor.int_val [0] = llvm_const_value llvm_const_type(const_value)返回llvm_const_value DEF create_engine(模块):特征= llvm.get_host_cpu_features().flatten()llvm_module = llvm.parse_assembly(STR(模块))目标= llvm.Target.from_default_triple()target_machine = target.create_target_machine(优化= 3,设有=特征)发动机=llvm.create_mcjit_compiler(llvm_module,target_machine)engine.finalize_object()打印target_machine.emit_assembly(llvm_module)返回发动机DEF run_main():图表= TFGraph( “graph.pb”,假)input_node = graph.get_node( “输入”)output_node= graph.get_node( “输出”)INPUT_TYPE = TYPE_TF_LLVM [input_node.attr [ “D型”]。式] output_type = TYPE_TF_LLVM [output_node.attr [ “T”]。式]模块= ll.Module()func_type = 11。FunctionType(output_type, [input_type]) func = ll.Function(module, func_type, name='tensorflow_graph') func.args[0].name = 'input' bb_entry = func.append_basic_block('entry') ir_builder = ll.IRBuilder(bb_entry) ret = build_graph(ir_builder, graph, output_node) ir_builder.ret(ret) with open("output.ir", "w") as f: f.write(str(module)) engine = create_engine(module) func_ptr = engine.get_function_address("tensorflow_graph") cfunc = CFUNCTYPE(c_int, c_int)(func_ptr) ret = cfunc(10) print "Execution output: {}".format(ret) if __name__ == "__main__": run_main()
Cite this article as: Christian S. Perone, "JIT native code generation for TensorFlow computation graphs using Python and LLVM," in亚洲金博宝未知领域,22/08/2016,//www.cpetem.com/2016/08/jit-native-code-generation-for-tensorflow-computation-graphs-using-python-and-llvm/

遗传编程和Python的限制AST表达式LLVM JIT

A small intro on the rationale

So I’m working on a Symbolic Regression Machine written in C/C++ called闪耀,其目的是对遗传编程文库的JIT(像Pyevolvefor instance). The main rationale behind Shine is that we have today a lot of research on speeding Genetic Programming using GPUs (the GPU fever !) or any other special hardware, etc, however we don’t have many papers talking about optimizing GP using the state of art compilers optimizations like we have on clang, gcc, etc.

“热点”或消耗大量的CPU资源的今天遗传编程是每一个人的评价,以计算程序树健身的组件。这种评价往往是在每一组的“培训”的参数设置执行。假设你想的一样勾股定理一个表达式的符号回归,你有1.0的参数的线性空间,以1000.0以0.1步你有你的人口的每一个人(程序树)10.000的评价!

什么服务所做的是下面的图片描述:

它采用遗传编程发动机的个体,然后将其转换为LLVM Intermediate Representation(LLVM汇编语言),之后它运行的改造经过LLVM的(这里是现代编译器的真正力量在GP背景下进入),然后将LLVM JIT优化的LLVM IR转换为本地代码的指定目标(X86和PowerPC等)。

你可以看到服务的体系结构如下:

这种架构带来了遗传规划了很大的灵活性,可以为可能后来由LLVM支持的任何语言你的个人使用情况下写的功能,哪些事项服务是LLVM IR,你可以使用任何语言,LLVM支持然后使用由LLVM产生的IR,可以从C,C ++,Ada的,FORTRAN,d,等混合代码,并使用自己的函数作为遗传规划树的非末端节点。

闪耀is still on its earlier development, it looks a simple idea but I still have a lot of problems to solve, things like how to JIT the evaluation process itself instead of doing calls from Python using ctypes bindings of the JITed trees.

对Python的AST本身做遗传编程

During the development of Shine, an idea happened to me, that I could use a restricted Python抽象语法树(AST)作为一个遗传编程引擎个人表示,这样做的主要优点是灵活性和重用了很多东西的可能性。Of course that a shared library written in C/C++ would be useful for a lot of Genetic Programming engines that doesn’t uses Python, but since my spare time to work on this is becoming more and more rare I started to rethink the approach and use Python and the LLVM bindings for LLVM (LLVMPY),我才发现,原来是很容易使用JIT LLVM一组有限的Python的AST的本地代码,而这也正是这篇文章将会显现。

JIT'ing受限的Python AST

LLVM的最惊人的部分显然是经过改造,所述JIT的量,当然通过一个简单的API使用整个框架的能力(确定,不是那么简单有时)。为了简化这个例子中,我将使用任意的限制AST集合了Python AST仅支持减( - ),加(+),乘(*)和除法(/)。

要了解Python的AST,你可以使用Python解析器,转换成源AST:

>>>进口AST >>> ASTP = ast.parse( “2 * 7”)>>> ast.dump(ASTP)“模块(体= [Expr的(值= BinOp(左= NUM​​(N = 2),OP = MULT(),右= NUM​​(N = 7)))])”

什么是解析创建了包含抽象语法树BinOp(Binary Operation) with the left operator as the number 2, the right operator as the number 7 and the operation itself as乘法(多重),很容易亚洲金博宝理解。我们现在要做的创建LLVM IR是创建将访问树的每个节点的访问者。要做到这一点,我们也可以继承了PythonNodeVisitor从类AST模块。What the NodeVisitor does is to visit each node of the tree and then call the method ‘visit_OPERATOR’ if it exists, when the NodeVisitor is going to visit the node for the BinOp for example, it will call the method ‘visit_BinOp’ passing as parameter the BinOp node itself.

在类的JIT游客将看起来像下面的代码的结构:

#导入AST和LLVM进口* LLVM的Python绑定进口AST从llvm.core进口*从llvm.ee进口*进口llvm.passes作为LP级AstJit(ast.NodeVisitor):DEF __init __(个体经营):通

What we need to do now is to create an initialization method to keep the last state of the JIT visitor, this is needed because we are going to JIT the content of the Python AST into a function and the last instruction of the function needs to return what was the result of the last instruction visited by the JIT. We also need to receive a LLVM Module object in which our function will be created as well the closure type, for the sake of simplicity I’m not type any object, I’m just assuming that all numbers from the expression are integers, so the closure type will be the LLVM integer type.

def __init__(self, module, parameters): self.last_state = None self.module = module # Parameters that will be created on the IR function self.parameters = parameters self.closure_type = Type.int() # An attribute to hold a link to the created function # so we can use it to JIT later self.func_obj = None self._create_builder() def _create_builder(self): # How many parameters of integer type params = [self.closure_type] * len(self.parameters) # The prototype of the function, returning a integer # and receiving the integer parameters ty_func = Type.function(self.closure_type, params) # Add the function to the module with the name 'func_ast_jit' self.func_obj = self.module.add_function(ty_func, 'func_ast_jit') # Create an argument in the function for each parameter specified for index, pname in enumerate(self.parameters): self.func_obj.args[index].name = pname # Create a basic block and the builder bb = self.func_obj.append_basic_block("entry") self.builder = Builder.new(bb)

Now what we need to implement on our visitor is the ‘visit_OPERATOR’ methods for theBinOp并为名称运营商。我们还将实施创造了返回指令,将返回的最后状态的方法。

#A“名称”是在AST生产时访问#变量,比如“2 + X + Y”,“X”和“y”是#对AST为表达式创建的两个名称的节点。高清visit_Name(个体经营,节点):#这个变量就是函数的参数?指数= self.parameters.index(node.id)self.last_state = self.func_obj.args [指数]返回self.last_state#这里我们创建一个LLVM IR整数常量使用#货号节点,在表达式“2 + 3“你有两个#民节点上,NUM(N = 2)和民(N = 3)。高清visit_Num(个体经营,节点):self.last_state = Constant.int(self.closure_type,node.n)返回self.last_state#为DEF visit_BinOp二元运算访问者(自我,节点):#获取操作,左,右参数LHS = self.visit(node.left)RHS = self.visit(node.right)OP = node.op#转换每个操作(子,添加,MULT,DIV)到其#LLVM IR整数指令等效如果isinstance(OP,ast.Sub):OP = self.builder.sub(左,右轴, 'sub_t')的elif isinstance(OP,ast.Add):OP = self.builder.add(左,右轴, 'add_t')elif的isinstance(OP,ast.Mult):OP = self.builder.mul(左,右轴, 'mul_t')的elif isinstance(OP,ast.Div):OP = self.builder.sdiv(左,右轴,“sdiv_t“)self.last_state =运回self.last_state#建立与过去的状态返回(RET)语句高清build_return(个体经营):self.builder.ret(self.last_state)

And that is it, our visitor is ready to convert a Python AST to a LLVM IR assembly language, to run it we’ll first create a LLVM module and an expression:

模块= Module.new( 'ast_jit_module')#请注意,我使用两个变量 'A' 和 'b' EXPR =“(2 + 3 * B + 33 *(10/2)+ 1 + 3/3 +一)/ 2" 节点= ast.parse(表达式)打印ast.dump(节点)

Will output:

模块(体= [Expr的(值= BinOp(左= BinOp(左= BinOp(左= BinOp(左= BinOp(左= BinOp(左= NUM​​(N = 2),OP =添加(),右= BinOp(左= NUM​​(N = 3),OP = MULT(),右=名称(ID = 'b',CTX =负载()))),OP =添加(),右= BinOp(左= NUM​​(N =33),OP = MULT(),右= NUM​​(N = 2))),OP =添加(),右= NUM​​(N = 1)),OP =添加(),右= NUM​​(N = 3)),OP =添加(),右=名称(ID = 'A',CT​​X =负载())),OP =股利(),右= NUM​​(N = 2)))])

现在,我们终于可以对生成AST运行我们的访问者检查LLVM IR输出:

访问者= AstJit(模块,[ '一', 'B'])visitor.visit(节点)visitor.build_return()打印模块

Will output the LLVM IR:

;的moduleId = 'ast_jit_module' 限定I32 @func_ast_jit(I32%A,I32%B){条目:%mul_t = MUL I32 3,%B%add_t =添加I32 2,%mul_t%add_t1 =添加I32%add_t,165%add_t2=添加I32%add_t1,1个%add_t3 =添加I32%add_t2,1%add_t4 =添加I32%add_t3,%A%sdiv_t = SDIV I32%add_t4,2 RET I32%sdiv_t}

现在是真正的乐趣开始的时候,我们要运行LLVM优化过程具有同等GCC -02优化级别来优化我们的代码,要做到这一点,我们创建一个PassManagerBuilder和PassManager的PassManagerBuilder是增加了通行证组件PassManager,您也可以手动添加像死代码消除,内联函数等任意的变换:

PMB = lp.PassManagerBuilder.new()#优化级别pmb.opt_level =下午2点= lp.PassManager.new()pmb.populate(下午)#执行通入模块pm.run(模块)的打印模块

Will output:

;的moduleId = 'ast_jit_module' 限定I32 @func_ast_jit(I32%A,I32%B)非展开readnone {条目:%mul_t = MUL I32%B,3%add_t3 =添加I32%A,169%add_t4 =添加I32%add_t3,%mul_t%sdiv_t = SDIV I32%add_t4,2 RET I32%sdiv_t}

在这里,我们拥有了Python AST表达的优化的LLVM IR。下一步骤是将其JIT IR为本地代码,然后与一些参数执行它:

EE = ExecutionEngine.new(模块)arg_a = GenericValue.int(Type.int(),100)= arg_b GenericValue.int(Type.int(),42)= RETVAL ee.run_function(visitor.func_obj,[arg_a,arg_b])打印 “返回值:%d” %retval.as_int()

Will output:

返回:197

And that’s it, you have created a AST->LLVM IR converter, optimized the LLVM IR with the transformation passes and then converted it to native code using the LLVM execution engine. I hope you liked =)

引用本文为:基督教S. Perone,“遗传编程和Python的限制AST表达式LLVM JIT,”在亚洲金博宝未知领域,15/08/2012,//www.cpetem.com/2012/08/genetic-programming-and-a-llvm-jit-for-restricted-python-ast-expressions/

A method for JIT’ing algorithms and data structures with LLVM

llvm_dragon

你好人,我总是张贴关于Python和EvoComp(Pyevolve),但这次是关于C,LLVM,搜索算法和数据结构。这篇文章描述了实现一个想法的努力:以JIT(动词)算法和由它们所使用的,一起的数据结构。

AVL树简介

下面是一个简短的介绍,以AVL Treesfrom Wikipedia:

在计算机科学中,AVL树是一个平衡树,它是被发明第一个这样的数据结构。在AVL树,任何节点的两个子子树的高度最多由一个不同;因此,它也被认为是高度平衡。查找,插入和删除都以O(日志n)时间的平均和最坏情况下,两个,其中n是节点在树中的号码之前,该操作。插入和缺失可能需要树由一个或多个树旋转来重新平衡。

问题和想法

当我们有一个数据结构和算法来处理(插入,删除和查找),其结构,我们的算法的本地代码通常是满的开销;例如,在一个AVL树(平衡二叉树),开销出现在:检查,如果我们真的有左或右节点,而对于遍历查找节点,接入节点内的节点,等这方面的负担造成不必要的组装业务,其在转,创建本机代码的开销,甚至当编译器优化。此开销对我们的算法的性能直接影响(这种传统的做法,当然,给我们一个非常灵活的结​​构和复杂性(亚洲金博宝没有大O)容易处理,但我们为它付出:性能损失)。

阅读更多