blazingsql.BlazingContext¶
-
class
blazingsql.
BlazingContext
(dask_client='autocheck', network_interface=None, allocator='default', pool=False, initial_pool_size=None, maximum_pool_size=None, enable_logging=False, enable_progress_bar=False, config_options={})¶ BlazingContext is the Python API of BlazingSQL. Along with initialization arguments allowing for easy multi-GPU distribution, the BlazingContext class has a number of methods which assist not only in creating and querying tables, but also in connecting remote data sources and understanding your ETL.
- Parameters
dask_client –
Client
object fromdask.distributed
. The dask client used for communicating with other nodes. This is only necessary for running BlazingSQL with multiple nodes. Default:"autocheck"
network_interface – string. Network interface used for communicating with the dask-scheduler. Default:
None
. See note below.allocator – string, allowed options are
"default"
,"managed"
or'existing'
. Where"managed"
uses Unified Virtual Memory (UVM) and may use system memory if GPU memory runs out, or"existing"
where it assumes you have already set the rmm allocator and therefore does not initialize it (this is for advanced users.) Default:"default"
pool – boolean. If
True
, allocate a memory pool in the beginning. This can greatly improve performance. Default:False
initial_pool_size – long integer. Initial size of memory pool in bytes (if pool=True). If
None
, it will default to using half of the GPU memory. Default:None
maximum_pool_size – long integer. Maximum size of the pool. Default:
None
enable_logging – boolean. If set to
True
the memory allocator logging will be enabled. This can negatively impact performance and is aimed at advanced users. Default:False
enable_progress_bar – boolean. Set to
True
to display a progress bar during query executions. Default:False
config_options –
dictionary. A dictionary for setting certain parameters in the engine. Default:
{}
List of options:- JOIN_PARTITION_SIZE_THRESHOLD: long integer
Num bytes to try to have the partitions for each side of a join before doing the join. Too small can lead to overpartitioning, too big can lead to OOM errors. Default:
400000000
- MAX_JOIN_SCATTER_MEM_OVERHEAD: long integer
The bigger this value, the more likely one of the tables of join will be scattered to all the nodes, instead of doing a standard hash based partitioning shuffle. Value is in bytes. Default:
500000000
- MAX_NUM_ORDER_BY_PARTITIONS_PER_NODE: integer
The maximum number of partitions that will be made for an order by. Increse this number if running into OOM issues when doing order bys with large amounts of data. Default:
8
- NUM_BYTES_PER_ORDER_BY_PARTITION: long integer
The max number size in bytes for each order by partition. Note that,
MAX_NUM_ORDER_BY_PARTITIONS_PER_NODE
will be enforced over this parameter. Default:400000000
- MAX_DATA_LOAD_CONCAT_CACHE_BYTE_SIZE: long integer
The max size in bytes to concatenate the batches read from the scan kernels Default:
400000000
- MAX_ORDER_BY_SAMPLES_PER_NODE: integer
The max number order by samples to capture per node Default:
10000
- BLAZING_PROCESSING_DEVICE_MEM_CONSUMPTION_THRESHOLD: float
The percent (as a decimal) of total GPU memory that the memory that the task executor will be allowed to consume. NOTE: This parameter only works when used in the BlazingContext Default:
0.9
- BLAZING_DEVICE_MEM_CONSUMPTION_THRESHOLD: float
The percent (as a decimal) of total GPU memory that the memory resource will consider to be full NOTE: This parameter only works when used in the BlazingContext Default:
0.6
- BLAZ_HOST_MEM_CONSUMPTION_THRESHOLD: float
The percent (as a decimal) of total host memory that the memory resource will consider to be full. In the presence of several GPUs per server, this resource will be shared among all of them in equal parts. NOTE: This parameter only works when used in the BlazingContext Default:
0.75
- BLAZING_LOGGING_DIRECTORY: string
A folder path to place all logging files. The path can be relative or absolute. NOTE: This parameter only works when used in the BlazingContext Default:
'blazing_log'
- BLAZING_CACHE_DIRECTORY: string
A folder path to place all orc files when start caching on Disk. The path can be relative or absolute. NOTE: This parameter only works when used in the BlazingContext Default:
'/tmp/'
- BLAZING_LOCAL_LOGGING_DIRECTORY: string
A folder path to place the client logging file on a dask environment. The path can be relative or absolute. NOTE: This parameter only works when used in the BlazingContext Default:
'blazing_log'
- MEMORY_MONITOR_PERIOD: integer
How often the memory monitor checks memory consumption. The value is in milliseconds. Default:
50
(milliseconds)- MAX_KERNEL_RUN_THREADS: integer
The number of threads available to run kernels simultaneously. Default:
16
- EXECUTOR_THREADS: integer
The number of threads available to run executor tasks simultaneously. Default:
10
- MAX_SEND_MESSAGE_THREADS: integer
The number of threads available to send outgoing messages. Default:
20
- LOGGING_LEVEL: string
Set the level (as string) to register into the logs for the current tool of logging. Log levels have order of priority:
{trace, debug, info, warn, err, critical, off}
. Using'trace'
will registers all info. NOTE: This parameter only works when used in the BlazingContext Default:'trace'
- LOGGING_FLUSH_LEVEL: string
Set the level (as string) of the flush for the current tool of logging. Log levels have order of priority:
{trace, debug, info, warn, err, critical, off}
NOTE: This parameter only works when used in the BlazingContext Default:'warn'
- ENABLE_GENERAL_ENGINE_LOGS: boolean
Enables
'batch_logger'
logger Default:True
- ENABLE_COMMS_LOGS: boolean
Enables
'output_comms'
and'input_comms'
logger Default:False
- ENABLE_TASK_LOGS: boolean
Enables
'task_logger'
logger Default:False
- ENABLE_OTHER_ENGINE_LOGS: boolean
Enables
'queries_logger'
,'kernels_logger'
,'kernels_edges_logger'
,'cache_events_logger'
loggers Default:False
- LOGGING_MAX_SIZE_PER_FILE: string
Set the max size in bytes for the log files. NOTE: This parameter only works when used in the BlazingContext Default:
1GB
- TRANSPORT_BUFFER_BYTE_SIZE: string
The size in bytes about the pinned buffer memory Default:
1MBs
- TRANSPORT_POOL_NUM_BUFFERS: integer
The number of buffers in the punned buffer memory pool. Default:
1000
(buffers)- PROTOCOL: string
The protocol to use with the current BlazingContext. It should use what the user set. If the user does not explicitly set it, by default it will be set by whatever dask client is using (
'tcp'
,'ucx'
, ..). NOTE: This parameter only works when used in the BlazingContext. Default:'tcp'
Note
When using BlazingSQL with multiple nodes, you will need to set the correct
network_interface
your servers are using to communicate with the IP address of the dask-scheduler. You can see the different network interfaces and what IP addresses they serve with the bash commandifconfig
. The default is set to'eth0'
.- Returns
BlazingContext
object
-
__init__
(dask_client='autocheck', network_interface=None, allocator='default', pool=False, initial_pool_size=None, maximum_pool_size=None, enable_logging=False, enable_progress_bar=False, config_options={})¶ Initialize self. See help(type(self)) for accurate signature.
Methods
create_table
(table_name, input, **kwargs)Create a BlazingSQL table.
describe_table
(table_name)Returns a dictionary with the names of all the columns and their types for the specified table.
drop_table
(table_name)Drop table from BlazingContext memory.
explain
(sql[, detail])Returns break down of a given query’s Logical Relational Algebra plan.
fetch
(token)This function returns a dictionary which contains as key the gpuID and as value the free memory (bytes)
This function returns a dictionary which contains as key the gpuID and as value the max memory (bytes)
gs
(prefix, **kwargs)Register a Google Storage bucket.
hdfs
(prefix, **kwargs)Register a Hadoop Distributed File System (HDFS) Cluster.
Returns a list with the names of all created tables.
log
(query[, logs_table_name])Query BlazingSQL’s internal log (bsql_logs) that records events from all queries run.
This function resets the max memory usage counter to 0
s3
(prefix, **kwargs)Register an AWS S3 bucket.
sql
(query[, algebra, config_options, …])Query a BlazingSQL table.
status
(token)