This is the post of 2020, sohappy new yearto you all !

I’m a huge fan of LLVM since 11 years ago when I started playing with it toJIT data structuressuch as AVLs, then later toJIT限制AST树and toJIT native code from TensorFlow graphs。Since then, LLVM evolved into one of the most important compiler framework ecosystem and is used nowadays by a lot of important open-source projects.

One cool project that I recently became aware of isGandiva。Gandiva was developed byDremio和n later捐赠给Apache的箭kudos to Dremio team for that). The main idea of Gandiva is that it provides a compiler to generate LLVM IR that can operate on batches ofApache Arrow。Gandiva被用C ++编写,并配有很多实现构建表达式树,可以是使用JIT'ed LLVM不同的功能。这种设计的一个很好的特性是,它可以使用LLVM来自动优化复杂的表达式,增加了原生的目标平台矢量如AVX同时箭批量操作和执行本机代码,以计算表达式。

The image below gives an overview of Gandiva:

An overview of how Gandiva works. Image from:

In this post I’ll build a very simple expression parser supporting a limited set of operations that I will use to filter a Pandas DataFrame.

Building simple expression with Gandiva

In this section I’ll show how to create a simple expression manually using tree builder from Gandiva.

Using Gandiva Python bindings to JIT and expression

Before building our parser and expression builder for expressions, let’s manually build a simple expression with Gandiva. First, we will create a simple Pandas DataFrame with numbers from 0.0 to 9.0:

进口熊猫作为PD进口pyarrow为PA进口pyarrow.gandiva作为gandiva#创建一个简单的熊猫数据帧DF = pd.DataFrame({ “×”:[1.0 * I为i的范围(10)]})表= pa.Table.from_pandas(DF)架构= pa.Schema.from_pandas(DF)

We converted the DataFrame to anArrow Table, it is important to note that in this case it was a zero-copy operation, Arrow isn’t copying data from Pandas and duplicating the DataFrame. Later we get theschemafrom the table, that contains column types and other metadata.

After that, we want to use Gandiva to build the following expression to filter the data:

(X> 2.0)和(x <6.0)

This expression will be built using nodes from Gandiva:

builder = gandiva.TreeExprBuilder() # Reference the column "x" node_x = builder.make_field(table.schema.field("x")) # Make two literals: 2.0 and 6.0 two = builder.make_literal(2.0, pa.float64()) six = builder.make_literal(6.0, pa.float64()) # Create a function for "x > 2.0" gt_five_node = builder.make_function("greater_than", [node_x, two], pa.bool_()) # Create a function for "x < 6.0" lt_ten_node = builder.make_function("less_than", [node_x, six], pa.bool_()) # Create an "and" node, for "(x > 2.0) and (x < 6.0)" and_node = builder.make_and([gt_five_node, lt_ten_node]) # Make the expression a condition and create a filter condition = builder.make_condition(and_node) filter_ = gandiva.make_filter(table.schema, condition)

This code now looks a little more complex but it is easy to understand. We are basically creating the nodes of a tree that will represent the expression we showed earlier. Here is a graphical representation of what it looks like:

Inspecting the generated LLVM IR

Unfortunately, haven’t found a way to dump the LLVM IR that was generated using the Arrow’s Python bindings, however, we can just use the C++ API to build the same tree and then look at the generated LLVM IR:

auto field_x = field("x", float32()); auto schema = arrow::schema({field_x}); auto node_x = TreeExprBuilder::MakeField(field_x); auto two = TreeExprBuilder::MakeLiteral((float_t)2.0); auto six = TreeExprBuilder::MakeLiteral((float_t)6.0); auto gt_five_node = TreeExprBuilder::MakeFunction("greater_than", {node_x, two}, arrow::boolean()); auto lt_ten_node = TreeExprBuilder::MakeFunction("less_than", {node_x, six}, arrow::boolean()); auto and_node = TreeExprBuilder::MakeAnd({gt_five_node, lt_ten_node}); auto condition = TreeExprBuilder::MakeCondition(and_node); std::shared_ptr filter; auto status = Filter::Make(schema, condition, TestConfiguration(), &filter);

The code above is the same as the Python code, but using the C++ Gandiva API. Now that we built the tree in C++, we can get the LLVM Module and dump the IR code for it. The generated IR is full of boilerplate code and the JIT’ed functions from the Gandiva registry, however the important parts are show below:

;功能ATTRS:alwaysinline norecurse非展开readnone SSP uwtable限定内部zeroext I1 @ less_than_float32_float32(浮点,浮点)local_unnamed_addr#0 {%3 = FCMP OLT浮子%0%1 RET I1%3};功能ATTRS:alwaysinline norecurse非展开readnone SSP uwtable限定内部zeroext I1 @ greater_than_float32_float32(浮点,浮点)local_unnamed_addr#0 {%3 = FCMP OGT浮子%0%1 RET I1%3}(...)%×=负载浮,浮子*%11%greater_than_float32_float32 =调用I1 @ greater_than_float32_float32(浮动%的x,浮子2.000000e + 00)(...)%X11 =负载浮子,浮子*%15%less_than_float32_float32 =调用I1 @ less_than_float32_float32(浮动%X11,浮子6.000000e + 00)

As you can see, on the IR we can see the call to the functionsless_than_float32_float_32andgreater_than_float32_float32这是(在这种情况下很简单的)Gandiva功能做浮动比亚洲金博宝较。通过查看函数名前缀注意函数的专业化。


这IR代码我发现是不是真正执行了一个,但优化的一个。和在优化的一个我们可以看到,内联LLVM的功能,如显示在下面的优化代码的一部分: = load float, float* %10, align 4 %11 = fcmp ogt float, 2.000000e+00 %12 = fcmp olt float, 6.000000e+00 %not.or.cond = and i1 %12, %11

You can see that the expression is now much simpler after optimization as LLVM applied its powerful optimizations and inlined a lot of Gandiva funcions.


现在,我们希望能够实现,因为大熊猫类似的东西DataFrame.query()function using Gandiva. The first problem we will face is that we need to parse a string such as(X> 2.0)和(x <6.0), later we will have to build the Gandiva expression tree using the tree builder from Gandiva and then evaluate that expression on arrow data.

Now, instead of implementing a full parsing of the expression string, I’ll use the Python AST module to parse valid Python code and build an Abstract Syntax Tree (AST) of that expression, that I’ll be later using to emit the Gandiva/LLVM nodes.

The heavy work of parsing the string will be delegated to Python AST module and our work will be mostly walking on this tree and emitting the Gandiva nodes based on that syntax tree. The code for visiting the nodes of this Python AST tree and emitting Gandiva nodes is shown below:

类LLVMGandivaVisitor(ast.NodeVisitor):对于f中self.builder.make_field(F):DEF __init __(个体,df_table):self.table = df_table self.builder = gandiva.TreeExprBuilder()self.columns = {f.nameself.table.schema} self.compare_ops = { “GT”: “GREATER_THAN”, “LT”: “LESS_THAN”,} self.bin_ops = { “BITAND”:self.builder.make_and, “BITOR”:self.builder。make_or, } def visit_Module(self, node): return self.visit(node.body[0]) def visit_BinOp(self, node): left = self.visit(node.left) right = self.visit(node.right) op_name = node.op.__class__.__name__ gandiva_bin_op = self.bin_ops[op_name] return gandiva_bin_op([left, right]) def visit_Compare(self, node): op = node.ops[0] op_name = op.__class__.__name__ gandiva_comp_op = self.compare_ops[op_name] comparators = self.visit(node.comparators[0]) left = self.visit(node.left) return self.builder.make_function(gandiva_comp_op, [left, comparators], pa.bool_()) def visit_Num(self, node): return self.builder.make_literal(node.n, pa.float64()) def visit_Expr(self, node): return self.visit(node.value) def visit_Name(self, node): return self.columns[] def generic_visit(self, node): return node def evaluate_filter(self, llvm_mod): condition = self.builder.make_condition(llvm_mod) filter_ = gandiva.make_filter(self.table.schema, condition) result = filter_.evaluate(self.table.to_batches()[0], pa.default_memory_pool()) arr = result.to_array() pd_result = arr.to_numpy() return pd_result @staticmethod def gandiva_query(df, query): df_table = pa.Table.from_pandas(df) llvm_gandiva_visitor = LLVMGandivaVisitor(df_table) mod_f = ast.parse(query) llvm_mod = llvm_gandiva_visitor.visit(mod_f) results = llvm_gandiva_visitor.evaluate_filter(llvm_mod) return results

正如你所看到的,它的代码,我不支持每一个可能的Python表达式,但它的一个子集轻微非常简单。亚洲金博宝我们做这个班什么是基本的比较和BinOps(二元运算)的Gandiva节点的Python AST的转换节点,。我也正在改变的语义|运营商来表示AND和OR分别如在熊猫query()function.

Register as a Pandas extension

The next step is to create a simple Pandas extension using thegandiva_query()method that we created:

@pd.api.extensions.register_dataframe_accessor("gandiva") class GandivaAcessor: def __init__(self, pandas_obj): self.pandas_obj = pandas_obj def query(self, query): return LLVMGandivaVisitor.gandiva_query(self.pandas_obj, query)

And that is it, now we can use this extension to do things such as:

df = pd.DataFrame({"a": [1.0 * i for i in range(nsize)]}) results = df.gandiva.query("a > 10.0")

As we have registered a Pandas extension calledgandivathat is now a first-class citizen of the Pandas DataFrames.

Let’s create now a 5 million floats DataFrame and use the newquery()method to filter it:

df = pd.DataFrame({"a": [1.0 * i for i in range(50000000)]}) df.gandiva.query("a < 4.0") # This will output: # array([0, 1, 2, 3], dtype=uint32)

Note that the returned values are the indexes satisfying the condition we implemented, so it is different than the Pandasquery()that returns the data already filtered.


That’s it ! I hope you liked the post as I enjoyed exploring Gandiva. It seems that we will probably have more and more tools coming up with Gandiva acceleration, specially for SQL parsing/projection/JITing. Gandiva is much more than what I just showed, but you can get started now to understand more of its architecture and how to build the expression trees.

– Christian S. Perone

Cite this article as: Christian S. Perone, "Gandiva, using LLVM and Arrow to JIT and evaluate Pandas expressions," in亚洲金博宝未知领域, 19/01/2020,//