ArcticDB_demo_querybuilder

ArcticDB Query Builder demo

Note that all of the functionality demonstrated here is available in the more intuitive LazyDataFrame API, which we recommend is used in preference to the QueryBuilder.

In this demo, we will explore the different functionalities of the QueryBuilder for ArcticDB. We will cover various possibilities of this API, including:

Filtering
Projections
Groupbys and Aggregations
Combinations of the above features

Why use QueryBuilder?

Performance boost via efficient C++ implementation that uses multi-threading
Efficient data access - only reads the data needed
For very large data sets some queries are possible that would not fit into memory

Demo setup¶

Necessary packages installation

In [ ]:

Copied!

!pip install arcticdb
!pip install arcticdb

Necessary libraries imports

In [3]:

Copied!





import os
import numpy as np
import pandas as pd
import random
import arcticdb as adb
from arcticdb.util.test import random_strings_of_length
import os
import numpy as np
import pandas as pd
import random
import arcticdb as adb
from arcticdb.util.test import random_strings_of_length

For this demo we will configure the LMDB file based backend. ArcticDB achieves its high performance and scale when configured with an object store backend (e.g. S3).

In [4]:

Copied!

arctic = adb.Arctic("lmdb://arcticdb_demo")
arctic = adb.Arctic("lmdb://arcticdb_demo")

You can have an unlimited number of libraries, but we will just create one to start with.

In [5]:

Copied!





if 'sample' not in arctic.list_libraries():
    # library does not already exist
    arctic.create_library('sample')
lib = arctic.get_library('sample')
if 'sample' not in arctic.list_libraries():
    # library does not already exist
    arctic.create_library('sample')
lib = arctic.get_library('sample')

Run the cell to set up preliminary variables. 100,000 unique strings is a pathological case for us, as with the default row-slicing policy there are 100,000 rows per data segment, and so each unique strings will appear around once per data segment in this column.

In [6]:

Copied!





ten_grouping_values = random_strings_of_length(10, 10, True)
one_hundred_thousand_grouping_values = random_strings_of_length(100_000, 10, True)
rng = np.random.RandomState()

sym_10M = "demo_10M"
sym_100M = "demo_100M"
sym_1B = "demo_1B"
ten_grouping_values = random_strings_of_length(10, 10, True)
one_hundred_thousand_grouping_values = random_strings_of_length(100_000, 10, True)
rng = np.random.RandomState()

sym_10M = "demo_10M"
sym_100M = "demo_100M"
sym_1B = "demo_1B"

Choose which symbol you want to work with

sym_10M: symbol with 10 million rows
sym_100M: symbol with 100 million rows
sym_1B: symbol with 1 billion rows

assign the symbol you want to work with to the sym variable

example: sym = sym_10M

In [7]:

Copied!

sym = sym_10M
sym = sym_10M

Run this cell to set up the DataFrame according to the symbol name

In [8]:

Copied!





if sym==sym_10M:
    num_rows = 10_000_000
elif sym==sym_100M:
    num_rows = 100_000_000
elif sym==sym_1B:
    num_rows = 1_000_000_000
df = pd.DataFrame(
    {
        "grouping_column_10": list(random.choices(ten_grouping_values, k=num_rows)),
        "grouping_column_100_000": list(random.choices(one_hundred_thousand_grouping_values, k=num_rows)),
        "numeric_column": rng.rand((num_rows))
    }
)
if sym==sym_10M:
    num_rows = 10_000_000
elif sym==sym_100M:
    num_rows = 100_000_000
elif sym==sym_1B:
    num_rows = 1_000_000_000
df = pd.DataFrame(
    {
        "grouping_column_10": list(random.choices(ten_grouping_values, k=num_rows)),
        "grouping_column_100_000": list(random.choices(one_hundred_thousand_grouping_values, k=num_rows)),
        "numeric_column": rng.rand((num_rows))
    }
)

Demo Start¶

In [ ]:

Copied!

lib.write(sym, df)
lib.write(sym, df)

Show how the data has been sliced and written to disk.

In [ ]:

Copied!

lib._nvs.read_index(sym)
lib._nvs.read_index(sym)

Show the first 100 rows of data as a sample.

In [ ]:

Copied!

lib.head(sym, n=100).data
lib.head(sym, n=100).data

Reading¶

Read the symbol without any filtering.

In [ ]:

Copied!

%%time
lib.read(sym)
%%time
lib.read(sym)

Most of the time is spent allocating Python strings in the column with 100,000 unique strings, so omitting this column is much faster.

In [ ]:

Copied!

%%time
lib.read(sym, columns=["grouping_column_10", "numeric_column"])
%%time
lib.read(sym, columns=["grouping_column_10", "numeric_column"])

Filtering¶

Note that all of the values in the numeric column are between 0 and 1. Thisquery therefore does not filter out any data. This demonstrates that doing a full table scan does not significantly impact the performance.

In [16]:

Copied!

q = adb.QueryBuilder()
q = q[q["numeric_column"] < 2.0]
q = adb.QueryBuilder()
q = q[q["numeric_column"] < 2.0]

In [ ]:

Copied!

%%time
lib.read(sym, query_builder=q)
%%time
lib.read(sym, query_builder=q)

Now we are filtering down to approximately 10% of the rows in the symbol. This is faster than reading, as there are now fewer Python strings to allocate.

In [18]:

Copied!

q = adb.QueryBuilder()
q = q[q["numeric_column"] < 0.1]
q = adb.QueryBuilder()
q = q[q["numeric_column"] < 0.1]

In [ ]:

Copied!

%%time
lib.read(sym, query_builder=q).data
%%time
lib.read(sym, query_builder=q).data

In [ ]:

Copied!

lib.read(sym, query_builder=q).data
lib.read(sym, query_builder=q).data

Projections¶

Creating a new column as a funtion of existing columns and constants is approximately the same speed as a filter that doesn't reduce the amount of data displayed.

In [21]:

Copied!

q = adb.QueryBuilder()
q = q.apply("new_column", q["numeric_column"] * 2.0)
q = adb.QueryBuilder()
q = q.apply("new_column", q["numeric_column"] * 2.0)

In [ ]:

Copied!

%%time
lib.read(sym, query_builder=q)
%%time
lib.read(sym, query_builder=q)

In [ ]:

Copied!

lib.read(sym, query_builder=q).data
lib.read(sym, query_builder=q).data

Groupbys and Aggregations¶

Grouping is again faster than just reading due to the reduced number of Python string allocations, even with the extra computation performed.

In [24]:

Copied!

q = adb.QueryBuilder()
q = q.groupby("grouping_column_10").agg({"numeric_column": "mean"})
q = adb.QueryBuilder()
q = q.groupby("grouping_column_10").agg({"numeric_column": "mean"})

In [ ]:

Copied!

%%time
lib.read(sym, query_builder=q)
%%time
lib.read(sym, query_builder=q)

In [ ]:

Copied!

lib.read(sym, query_builder=q).data
lib.read(sym, query_builder=q).data

Even grouping on a pathologically large number of unique values does not significantly reduce the performance.

In [27]:

Copied!

q = adb.QueryBuilder()
q = q.groupby("grouping_column_100_000").agg({"numeric_column": "mean"})
q = adb.QueryBuilder()
q = q.groupby("grouping_column_100_000").agg({"numeric_column": "mean"})

In [ ]:

Copied!

%%time
lib.read(sym, query_builder=q)
%%time
lib.read(sym, query_builder=q)

In [ ]:

Copied!

lib.read(sym, query_builder=q).data
lib.read(sym, query_builder=q).data

Combinations¶

These operations can be arbitrarily combined in a seqential pipeline.

In [30]:

Copied!

q = adb.QueryBuilder()
q = q[q["numeric_column"] < 0.1].apply("new_column", q["numeric_column"] * 2.0).groupby("grouping_column_10").agg({"numeric_column": "mean", "new_column": "max"})
q = adb.QueryBuilder()
q = q[q["numeric_column"] < 0.1].apply("new_column", q["numeric_column"] * 2.0).groupby("grouping_column_10").agg({"numeric_column": "mean", "new_column": "max"})

In [ ]:

Copied!

%%time
lib.read(sym, query_builder=q)
%%time
lib.read(sym, query_builder=q)

In [ ]:

Copied!

lib.read(sym, query_builder=q).data
lib.read(sym, query_builder=q).data