blazingsql.BlazingContext

class blazingsql.BlazingContext(dask_client='autocheck', network_interface=None, allocator='default', pool=False, initial_pool_size=None, maximum_pool_size=None, enable_logging=False, enable_progress_bar=False, config_options={})

BlazingContext is the Python API of BlazingSQL. Along with initialization arguments allowing for easy multi-GPU distribution, the BlazingContext class has a number of methods which assist not only in creating and querying tables, but also in connecting remote data sources and understanding your ETL.

Parameters
  • dask_clientClient object from dask.distributed. The dask client used for communicating with other nodes. This is only necessary for running BlazingSQL with multiple nodes. Default: "autocheck"

  • network_interface – string. Network interface used for communicating with the dask-scheduler. Default: None. See note below.

  • allocator – string, allowed options are "default", "managed" or 'existing'. Where "managed" uses Unified Virtual Memory (UVM) and may use system memory if GPU memory runs out, or "existing" where it assumes you have already set the rmm allocator and therefore does not initialize it (this is for advanced users.) Default: "default"

  • pool – boolean. If True, allocate a memory pool in the beginning. This can greatly improve performance. Default: False

  • initial_pool_size – long integer. Initial size of memory pool in bytes (if pool=True). If None, it will default to using half of the GPU memory. Default: None

  • maximum_pool_size – long integer. Maximum size of the pool. Default: None

  • enable_logging – boolean. If set to True the memory allocator logging will be enabled. This can negatively impact performance and is aimed at advanced users. Default: False

  • enable_progress_bar – boolean. Set to True to display a progress bar during query executions. Default: False

  • config_options

    dictionary. A dictionary for setting certain parameters in the engine. Default: {} List of options:

    JOIN_PARTITION_SIZE_THRESHOLD: long integer

    Num bytes to try to have the partitions for each side of a join before doing the join. Too small can lead to overpartitioning, too big can lead to OOM errors. Default: 400000000

    MAX_JOIN_SCATTER_MEM_OVERHEAD: long integer

    The bigger this value, the more likely one of the tables of join will be scattered to all the nodes, instead of doing a standard hash based partitioning shuffle. Value is in bytes. Default: 500000000

    MAX_NUM_ORDER_BY_PARTITIONS_PER_NODE: integer

    The maximum number of partitions that will be made for an order by. Increse this number if running into OOM issues when doing order bys with large amounts of data. Default: 8

    NUM_BYTES_PER_ORDER_BY_PARTITION: long integer

    The max number size in bytes for each order by partition. Note that, MAX_NUM_ORDER_BY_PARTITIONS_PER_NODE will be enforced over this parameter. Default: 400000000

    MAX_DATA_LOAD_CONCAT_CACHE_BYTE_SIZE: long integer

    The max size in bytes to concatenate the batches read from the scan kernels Default: 400000000

    MAX_ORDER_BY_SAMPLES_PER_NODE: integer

    The max number order by samples to capture per node Default: 10000

    BLAZING_PROCESSING_DEVICE_MEM_CONSUMPTION_THRESHOLD: float

    The percent (as a decimal) of total GPU memory that the memory that the task executor will be allowed to consume. NOTE: This parameter only works when used in the BlazingContext Default: 0.9

    BLAZING_DEVICE_MEM_CONSUMPTION_THRESHOLD: float

    The percent (as a decimal) of total GPU memory that the memory resource will consider to be full NOTE: This parameter only works when used in the BlazingContext Default: 0.6

    BLAZ_HOST_MEM_CONSUMPTION_THRESHOLD: float

    The percent (as a decimal) of total host memory that the memory resource will consider to be full. In the presence of several GPUs per server, this resource will be shared among all of them in equal parts. NOTE: This parameter only works when used in the BlazingContext Default: 0.75

    BLAZING_LOGGING_DIRECTORY: string

    A folder path to place all logging files. The path can be relative or absolute. NOTE: This parameter only works when used in the BlazingContext Default: 'blazing_log'

    BLAZING_CACHE_DIRECTORY: string

    A folder path to place all orc files when start caching on Disk. The path can be relative or absolute. NOTE: This parameter only works when used in the BlazingContext Default: '/tmp/'

    BLAZING_LOCAL_LOGGING_DIRECTORY: string

    A folder path to place the client logging file on a dask environment. The path can be relative or absolute. NOTE: This parameter only works when used in the BlazingContext Default: 'blazing_log'

    MEMORY_MONITOR_PERIOD: integer

    How often the memory monitor checks memory consumption. The value is in milliseconds. Default: 50 (milliseconds)

    MAX_KERNEL_RUN_THREADS: integer

    The number of threads available to run kernels simultaneously. Default: 16

    EXECUTOR_THREADS: integer

    The number of threads available to run executor tasks simultaneously. Default: 10

    MAX_SEND_MESSAGE_THREADS: integer

    The number of threads available to send outgoing messages. Default: 20

    LOGGING_LEVEL: string

    Set the level (as string) to register into the logs for the current tool of logging. Log levels have order of priority: {trace, debug, info, warn, err, critical, off}. Using 'trace' will registers all info. NOTE: This parameter only works when used in the BlazingContext Default: 'trace'

    LOGGING_FLUSH_LEVEL: string

    Set the level (as string) of the flush for the current tool of logging. Log levels have order of priority: {trace, debug, info, warn, err, critical, off} NOTE: This parameter only works when used in the BlazingContext Default: 'warn'

    ENABLE_GENERAL_ENGINE_LOGS: boolean

    Enables 'batch_logger' logger Default: True

    ENABLE_COMMS_LOGS: boolean

    Enables 'output_comms' and 'input_comms' logger Default: False

    ENABLE_TASK_LOGS: boolean

    Enables 'task_logger' logger Default: False

    ENABLE_OTHER_ENGINE_LOGS: boolean

    Enables 'queries_logger', 'kernels_logger', 'kernels_edges_logger', 'cache_events_logger' loggers Default: False

    LOGGING_MAX_SIZE_PER_FILE: string

    Set the max size in bytes for the log files. NOTE: This parameter only works when used in the BlazingContext Default: 1GB

    TRANSPORT_BUFFER_BYTE_SIZE: string

    The size in bytes about the pinned buffer memory Default: 1MBs

    TRANSPORT_POOL_NUM_BUFFERS: integer

    The number of buffers in the punned buffer memory pool. Default: 1000 (buffers)

    PROTOCOL: string

    The protocol to use with the current BlazingContext. It should use what the user set. If the user does not explicitly set it, by default it will be set by whatever dask client is using ('tcp', 'ucx', ..). NOTE: This parameter only works when used in the BlazingContext. Default: 'tcp'

Note

When using BlazingSQL with multiple nodes, you will need to set the correct network_interface your servers are using to communicate with the IP address of the dask-scheduler. You can see the different network interfaces and what IP addresses they serve with the bash command ifconfig. The default is set to 'eth0'.

Returns

BlazingContext object

__init__(dask_client='autocheck', network_interface=None, allocator='default', pool=False, initial_pool_size=None, maximum_pool_size=None, enable_logging=False, enable_progress_bar=False, config_options={})

Initialize self. See help(type(self)) for accurate signature.

Methods

create_table(table_name, input, **kwargs)

Create a BlazingSQL table.

describe_table(table_name)

Returns a dictionary with the names of all the columns and their types for the specified table.

drop_table(table_name)

Drop table from BlazingContext memory.

explain(sql[, detail])

Returns break down of a given query’s Logical Relational Algebra plan.

fetch(token)

get_free_memory()

This function returns a dictionary which contains as key the gpuID and as value the free memory (bytes)

get_max_memory_used()

This function returns a dictionary which contains as key the gpuID and as value the max memory (bytes)

gs(prefix, **kwargs)

Register a Google Storage bucket.

hdfs(prefix, **kwargs)

Register a Hadoop Distributed File System (HDFS) Cluster.

list_tables()

Returns a list with the names of all created tables.

log(query[, logs_table_name])

Query BlazingSQL’s internal log (bsql_logs) that records events from all queries run.

reset_max_memory_used()

This function resets the max memory usage counter to 0

s3(prefix, **kwargs)

Register an AWS S3 bucket.

show_filesystems()

sql(query[, algebra, config_options, …])

Query a BlazingSQL table.

status(token)