blazingsql.BlazingContext¶

class blazingsql.BlazingContext(dask_client='autocheck', network_interface=None, allocator='default', pool=False, initial_pool_size=None, maximum_pool_size=None, enable_logging=False, enable_progress_bar=False, config_options={})¶

BlazingContext is the Python API of BlazingSQL. Along with initialization arguments allowing for easy multi-GPU distribution, the BlazingContext class has a number of methods which assist not only in creating and querying tables, but also in connecting remote data sources and understanding your ETL.

Parameters

dask_client – Client object from dask.distributed. The dask client used for communicating with other nodes. This is only necessary for running BlazingSQL with multiple nodes. Default: "autocheck"
network_interface – string. Network interface used for communicating with the dask-scheduler. Default: None. See note below.
allocator – string, allowed options are "default", "managed" or 'existing'. Where "managed" uses Unified Virtual Memory (UVM) and may use system memory if GPU memory runs out, or "existing" where it assumes you have already set the rmm allocator and therefore does not initialize it (this is for advanced users.) Default: "default"
pool – boolean. If True, allocate a memory pool in the beginning. This can greatly improve performance. Default: False
initial_pool_size – long integer. Initial size of memory pool in bytes (if pool=True). If None, it will default to using half of the GPU memory. Default: None
maximum_pool_size – long integer. Maximum size of the pool. Default: None
enable_logging – boolean. If set to True the memory allocator logging will be enabled. This can negatively impact performance and is aimed at advanced users. Default: False
enable_progress_bar – boolean. Set to True to display a progress bar during query executions. Default: False
config_options –
dictionary. A dictionary for setting certain parameters in the engine. Default: {} List of options:

JOIN_PARTITION_SIZE_THRESHOLD: long integer
Num bytes to try to have the partitions for each side of a join before doing the join. Too small can lead to overpartitioning, too big can lead to OOM errors. Default: 400000000

MAX_JOIN_SCATTER_MEM_OVERHEAD: long integer
The bigger this value, the more likely one of the tables of join will be scattered to all the nodes, instead of doing a standard hash based partitioning shuffle. Value is in bytes. Default: 500000000

MAX_NUM_ORDER_BY_PARTITIONS_PER_NODE: integer
The maximum number of partitions that will be made for an order by. Increse this number if running into OOM issues when doing order bys with large amounts of data. Default: 8

NUM_BYTES_PER_ORDER_BY_PARTITION: long integer
The max number size in bytes for each order by partition. Note that, MAX_NUM_ORDER_BY_PARTITIONS_PER_NODE will be enforced over this parameter. Default: 400000000

MAX_DATA_LOAD_CONCAT_CACHE_BYTE_SIZE: long integer
The max size in bytes to concatenate the batches read from the scan kernels Default: 400000000

MAX_ORDER_BY_SAMPLES_PER_NODE: integer
The max number order by samples to capture per node Default: 10000

BLAZING_PROCESSING_DEVICE_MEM_CONSUMPTION_THRESHOLD: float
The percent (as a decimal) of total GPU memory that the memory that the task executor will be allowed to consume. NOTE: This parameter only works when used in the BlazingContext Default: 0.9

BLAZING_DEVICE_MEM_CONSUMPTION_THRESHOLD: float
The percent (as a decimal) of total GPU memory that the memory resource will consider to be full NOTE: This parameter only works when used in the BlazingContext Default: 0.6

BLAZ_HOST_MEM_CONSUMPTION_THRESHOLD: float
The percent (as a decimal) of total host memory that the memory resource will consider to be full. In the presence of several GPUs per server, this resource will be shared among all of them in equal parts. NOTE: This parameter only works when used in the BlazingContext Default: 0.75

BLAZING_LOGGING_DIRECTORY: string
A folder path to place all logging files. The path can be relative or absolute. NOTE: This parameter only works when used in the BlazingContext Default: 'blazing_log'

BLAZING_CACHE_DIRECTORY: string
A folder path to place all orc files when start caching on Disk. The path can be relative or absolute. NOTE: This parameter only works when used in the BlazingContext Default: '/tmp/'

BLAZING_LOCAL_LOGGING_DIRECTORY: string
A folder path to place the client logging file on a dask environment. The path can be relative or absolute. NOTE: This parameter only works when used in the BlazingContext Default: 'blazing_log'

MEMORY_MONITOR_PERIOD: integer
How often the memory monitor checks memory consumption. The value is in milliseconds. Default: 50 (milliseconds)

MAX_KERNEL_RUN_THREADS: integer
The number of threads available to run kernels simultaneously. Default: 16

EXECUTOR_THREADS: integer
The number of threads available to run executor tasks simultaneously. Default: 10

MAX_SEND_MESSAGE_THREADS: integer
The number of threads available to send outgoing messages. Default: 20

LOGGING_LEVEL: string
Set the level (as string) to register into the logs for the current tool of logging. Log levels have order of priority: {trace, debug, info, warn, err, critical, off}. Using 'trace' will registers all info. NOTE: This parameter only works when used in the BlazingContext Default: 'trace'

LOGGING_FLUSH_LEVEL: string
Set the level (as string) of the flush for the current tool of logging. Log levels have order of priority: {trace, debug, info, warn, err, critical, off} NOTE: This parameter only works when used in the BlazingContext Default: 'warn'

ENABLE_GENERAL_ENGINE_LOGS: boolean
Enables 'batch_logger' logger Default: True

ENABLE_COMMS_LOGS: boolean
Enables 'output_comms' and 'input_comms' logger Default: False

ENABLE_TASK_LOGS: boolean
Enables 'task_logger' logger Default: False

ENABLE_OTHER_ENGINE_LOGS: boolean
Enables 'queries_logger', 'kernels_logger', 'kernels_edges_logger', 'cache_events_logger' loggers Default: False

LOGGING_MAX_SIZE_PER_FILE: string
Set the max size in bytes for the log files. NOTE: This parameter only works when used in the BlazingContext Default: 1GB

TRANSPORT_BUFFER_BYTE_SIZE: string
The size in bytes about the pinned buffer memory Default: 1MBs

TRANSPORT_POOL_NUM_BUFFERS: integer
The number of buffers in the punned buffer memory pool. Default: 1000 (buffers)

PROTOCOL: string
The protocol to use with the current BlazingContext. It should use what the user set. If the user does not explicitly set it, by default it will be set by whatever dask client is using ('tcp', 'ucx', ..). NOTE: This parameter only works when used in the BlazingContext. Default: 'tcp'

Note

When using BlazingSQL with multiple nodes, you will need to set the correct network_interface your servers are using to communicate with the IP address of the dask-scheduler. You can see the different network interfaces and what IP addresses they serve with the bash command ifconfig. The default is set to 'eth0'.

Returns: BlazingContext object

__init__(dask_client='autocheck', network_interface=None, allocator='default', pool=False, initial_pool_size=None, maximum_pool_size=None, enable_logging=False, enable_progress_bar=False, config_options={})¶: Initialize self. See help(type(self)) for accurate signature.

Methods

`create_table`(table_name, input, **kwargs)	Create a BlazingSQL table.
`describe_table`(table_name)	Returns a dictionary with the names of all the columns and their types for the specified table.
`drop_table`(table_name)	Drop table from BlazingContext memory.
`explain`(sql[, detail])	Returns break down of a given query’s Logical Relational Algebra plan.
`fetch`(token)
`get_free_memory`()	This function returns a dictionary which contains as key the gpuID and as value the free memory (bytes)
`get_max_memory_used`()	This function returns a dictionary which contains as key the gpuID and as value the max memory (bytes)
`gs`(prefix, **kwargs)	Register a Google Storage bucket.
`hdfs`(prefix, **kwargs)	Register a Hadoop Distributed File System (HDFS) Cluster.
`list_tables`()	Returns a list with the names of all created tables.
`log`(query[, logs_table_name])	Query BlazingSQL’s internal log (bsql_logs) that records events from all queries run.
`reset_max_memory_used`()	This function resets the max memory usage counter to 0
`s3`(prefix, **kwargs)	Register an AWS S3 bucket.
`show_filesystems`()
`sql`(query[, algebra, config_options, …])	Query a BlazingSQL table.
`status`(token)

BlazingContext() blazingsql.BlazingContext.create_table

BlazingSQL v0.19 documentation

blazingsql.BlazingContext¶