DataStore has two compatibility modes that control whether output is shaped for pandas compatibility or optimized for raw SQL performance.
Overview
| Mode | compat_mode value | Description |
|---|
| Pandas (default) | "pandas" | Full pandas behavior compatibility. Row order preserved, MultiIndex, set_index, dtype corrections, stable sort tiebreakers, -If/isNaN wrappers. |
| Performance | "performance" | SQL-first execution. All pandas compatibility overhead removed. Maximum throughput, but results may differ structurally from pandas. |
What Performance Mode Disables
| Overhead | Pandas mode behavior | Performance mode behavior |
|---|
| Row-order preservation | _row_id injection, rowNumberInAllBlocks(), __orig_row_num__ subqueries | Disabled — row order not guaranteed |
| Stable sort tiebreaker | rowNumberInAllBlocks() ASC appended to ORDER BY | Disabled — ties may have arbitrary order |
| Parquet preserve_order | input_format_parquet_preserve_order=1 | Disabled — parallel Parquet reading allowed |
| GroupBy auto ORDER BY | ORDER BY group_key added (pandas default sort=True) | Disabled — groups returned in arbitrary order |
| GroupBy dropna WHERE | WHERE key IS NOT NULL added (pandas default dropna=True) | Disabled — NULL groups included |
| GroupBy set_index | Group keys set as index | Disabled — group keys stay as columns |
| MultiIndex columns | agg({'col': ['sum','mean']}) returns MultiIndex columns | Disabled — flat column names (col_sum, col_mean) |
-If/isNaN wrappers | sumIf(col, NOT isNaN(col)) for skipna | Disabled — plain sum(col) (ClickHouse natively skips NULL) |
toInt64 on count | toInt64(count()) to match pandas int64 | Disabled — native SQL dtype returned |
fillna(0) for all-NaN sum | Sum of all-NaN returns 0 (pandas behavior) | Disabled — returns NULL |
| Dtype corrections | abs() unsigned→signed, etc. | Disabled — native SQL dtypes |
| Index preservation | Restores original index after SQL execution | Disabled |
first()/last() | argMin/argMax(col, rowNumberInAllBlocks()) | any(col) / anyLast(col) — faster but non-deterministic |
| Single-SQL aggregation | ColumnExpr groupby materializes intermediate DataFrame | Injects LazyGroupByAgg into lazy ops chain — single SQL query |
Enabling Performance Mode
Using config object
from chdb.datastore.config import config
# Enable performance mode
config.use_performance_mode()
# Back to pandas compatibility
config.use_pandas_compat()
# Check current mode
print(config.compat_mode) # 'pandas' or 'performance'
Using module-level functions
from chdb.datastore.config import set_compat_mode, CompatMode, is_performance_mode
# Enable performance mode
set_compat_mode(CompatMode.PERFORMANCE)
# Check
print(is_performance_mode()) # True
# Back to default
set_compat_mode(CompatMode.PANDAS)
Using convenience imports
from chdb import use_performance_mode, use_pandas_compat
use_performance_mode()
# ... high-performance operations ...
use_pandas_compat()
Setting performance mode automatically sets the execution engine to chdb. You do not need to call config.use_chdb() separately.
When to Use Performance Mode
Use performance mode when:
- Processing large datasets (hundreds of thousands to millions of rows)
- Running aggregation-heavy workloads (groupby, sum, mean, count)
- Row order does not matter (e.g., aggregated results, reports, dashboards)
- You want maximum SQL throughput and minimal overhead
- Memory usage is a concern (parallel Parquet reading, no intermediate DataFrames)
Stay in pandas mode when:
- You need exact pandas behavior (row order, MultiIndex, dtypes)
- You rely on
first()/last() returning the true first/last row
- You use
shift(), diff(), cumsum() that depend on row order
- You’re writing tests that compare DataStore output with pandas
Behavior Differences
Row Order
In performance mode, row order is not guaranteed for any operation. This includes:
- Filter results
- GroupBy aggregation results
head() / tail() without explicit sort_values()
first() / last() aggregations
If you need ordered results, add an explicit sort_values():
config.use_performance_mode()
ds = pd.read_csv("data.csv")
# Unordered (fast)
result = ds.groupby("region")["revenue"].sum()
# Ordered (still fast, just adds ORDER BY)
result = ds.groupby("region")["revenue"].sum().sort_values()
GroupBy Results
| Aspect | Pandas mode | Performance mode |
|---|
| Group key location | Index (via set_index) | Regular column |
| Group order | Sorted by key (default) | Arbitrary order |
| NULL groups | Excluded (default dropna=True) | Included |
| Column format | MultiIndex for multi-agg | Flat names (col_func) |
first()/last() | Deterministic (row order) | Non-deterministic (any()/anyLast()) |
Aggregation
config.use_performance_mode()
# Sum of all-NaN group returns NULL (not 0)
# Count returns native uint64 (not forced int64)
# No -If wrappers: sum() instead of sumIf()
result = ds.groupby("cat")["val"].sum()
Single-SQL Execution
In performance mode, ColumnExpr groupby aggregation (e.g., ds[condition].groupby('col')['val'].sum()) is executed as a single SQL query instead of the two-step process used in pandas mode:
config.use_performance_mode()
# Pandas mode: two SQL queries (filter → materialize → groupby)
# Performance mode: one SQL query (WHERE + GROUP BY in same query)
result = ds[ds["rating"] > 3.5].groupby("category")["revenue"].sum()
# Generated SQL (single query):
# SELECT category, sum(revenue) FROM data WHERE rating > 3.5 GROUP BY category
This eliminates the intermediate DataFrame materialization and can significantly reduce memory usage and execution time.
Comparison with Execution Engine
Performance mode (compat_mode) and execution engine (execution_engine) are independent configuration axes:
| Config | Controls | Values |
|---|
execution_engine | Which engine runs the computation | auto, chdb, pandas |
compat_mode | Whether to reshape output for pandas compatibility | pandas, performance |
Setting compat_mode='performance' automatically sets execution_engine='chdb', since performance mode is designed for SQL execution.
from chdb.datastore.config import config
# These are independent
config.use_chdb() # Force chDB engine, keep pandas compat
config.use_performance_mode() # Force chDB + remove pandas overhead
Testing with Performance Mode
When writing tests for performance mode, results may differ from pandas in row order and structural format. Use these strategies:
Sort-then-compare (aggregations, filters)
# Sort both sides by the same columns before comparing
ds_result = ds.groupby("cat")["val"].sum()
pd_result = pd_df.groupby("cat")["val"].sum()
ds_sorted = ds_result.sort_index()
pd_sorted = pd_result.sort_index()
np.testing.assert_array_equal(ds_sorted.values, pd_sorted.values)
Value-range check (first/last)
# first() with any() returns an arbitrary element from the group
result = ds.groupby("cat")["val"].first()
for group_key in groups:
assert result.loc[group_key] in group_values[group_key]
Schema-and-count (LIMIT without ORDER BY)
# head() without sort_values: row set is non-deterministic
result = ds.head(5)
assert len(result) == 5
assert set(result.columns) == expected_columns
Best Practices
- Enable early in your script
from chdb.datastore.config import config
config.use_performance_mode()
# All subsequent operations benefit
ds = pd.read_parquet("data.parquet")
result = ds[ds["amount"] > 100].groupby("region")["amount"].sum()
- Add explicit sorting when order matters
# For display or downstream processing that expects order
result = (ds
.groupby("region")["revenue"].sum()
.sort_values(ascending=False)
)
- Use for batch/ETL workloads
config.use_performance_mode()
# ETL pipeline — order doesn't matter, throughput does
summary = (ds
.filter(ds["date"] >= "2024-01-01")
.groupby(["region", "product"])
.agg({"revenue": "sum", "quantity": "sum", "rating": "mean"})
)
summary.to_df().to_parquet("summary.parquet")
- Switch modes within a session
# Performance mode for heavy computation
config.use_performance_mode()
aggregated = ds.groupby("cat")["val"].sum()
# Back to pandas mode for exact-match comparison
config.use_pandas_compat()
detailed = ds[ds["val"] > 100].head(10)