GCS Parquet Explorer is a single-file Python web application for browsing and querying SAP CDS Parquet data across Google Cloud Storage, Azure Data Lake, and AWS S3. It provides an interactive web UI with DuckDB-powered SQL queries, schema inspection, and Iceberg table support via Polaris REST catalogs.
gcs_explorer.py — port 8765, HTTPgcs_explorer_server.py — port 443, HTTPSsap_cds_dbt
Everything lives in one .py file: Python backend + embedded HTML/CSS/JS frontend
in a raw string (HTML_PAGE = r"""..."""). Because the HTML is in a Python raw string,
\u unicode escapes do NOT work in CSS/JS — use literal UTF-8 characters or HTML entities.
| Layer | Technology |
|---|---|
| Server | http.server.ThreadingHTTPServer (multi-threaded) |
| DuckDB Isolation | Separate subprocess via multiprocessing.Process (GIL-isolated) |
| SQL Engine | DuckDB (in-memory, with httpfs + iceberg + azure extensions) |
| Azure Transport | SET azure_transport_option_type='curl' (fixes SSL on SUSE) |
| Parquet | PyArrow (pyarrow.parquet) — no pandas dependency |
| GCS | google-cloud-storage with Application Default Credentials |
| Iceberg | DuckDB Iceberg extension + Polaris REST catalog |
| Polaris REST API | Direct HTTP calls for fast namespace/table listing (~250x faster) |
| Memory | psutil for LRU eviction at 85% system memory |
| Auth | Email-based login with session cookies (server version only) |
| SSL | ZeroSSL certificate with chained PEM (server version only) |
| UI | Embedded HTML/CSS/JS, Fivetran light theme, no frameworks |
| Feature | Local (gcs_explorer.py) | Server (gcs_explorer_server.py) |
|---|---|---|
| Port | 8765 (HTTP) | 443 (HTTPS) |
| URL prefix | / | /datalake_reader/ |
| Login | None | Email-based (LOGIN_PASSWORD) |
| SSL | None | ZeroSSL cert + key |
| Threading | HTTPServer | ThreadingHTTPServer |
| DuckDB process | In-process | Separate subprocess (DuckDBWorker) |
| DuckDB comms | Direct calls | multiprocessing.Queue (command/response) |
| Hang recovery | None | Auto-kill + respawn worker on timeout |
| Socket backlog | Default (5) | 64 (request_queue_size = 64) |
| Azure transport | Default | curl (fixes SSL on SUSE) |
| Polaris listing | DuckDB information_schema | Direct Polaris REST API |
| Credential mgmt | None | UI panel to update client_id/secret |
| Process mgmt | Manual | systemd (auto-restart, boot start) |
gcs_explorer.pysap_cds_dbtsap_cds_views/~/.gcs_explorer_state.jsonduckdb pyarrow google-cloud-storage psutil
# Install dependencies pip install duckdb pyarrow google-cloud-storage psutil # Authenticate with GCS gcloud auth application-default login # Run the explorer python gcs_explorer.py
~/.gcs_explorer_state.json
so it resumes where you left off on next launch.
/root/miniconda/bin/python3 (3.13.9)(see vault)root@sapidesecc8sudo on sapidesecc8 uses /usr/bin/python3 which does NOT have
duckdb/pyarrow. Always use /root/miniconda/bin/python3 in service files and direct execution.
| File | Purpose |
|---|---|
/usr/sap/gcs_explorer_server.py | Deployed production copy (port 443) |
/usr/sap/gcs_explorer_server_dev.py | Deployed dev copy (port 8443) |
/usr/sap/gcs_explorer_cert.pem | SSL certificate (ZeroSSL, chained: server + intermediate CA) |
/usr/sap/gcs_explorer_key.pem | SSL private key |
/usr/sap/gcs_explorer.env | Environment variables (secrets) |
/etc/systemd/system/gcs-explorer.service | systemd service unit (production) |
/etc/systemd/system/gcs-explorer-dev.service | systemd service unit (dev) |
/var/log/gcs_explorer.log | Server log file (production) |
/var/log/gcs_explorer_dev.log | Server log file (dev) |
[Unit] Description=GCS Parquet Explorer Web Server After=network.target [Service] Type=simple WorkingDirectory=/usr/sap EnvironmentFile=-/usr/sap/gcs_explorer.env ExecStart=/root/miniconda/bin/python3 -u /usr/sap/gcs_explorer_server.py Restart=always RestartSec=5 StandardOutput=append:/var/log/gcs_explorer.log StandardError=append:/var/log/gcs_explorer.log [Install] WantedBy=multi-user.target
Features: auto-restart on crash (5s delay), auto-start on reboot, unbuffered output (-u) for immediate log visibility.
The server also hosts the SAP Skills Portal static files at the /sap_skills/ route,
serving HTML/CSS/JS from /usr/sap/sap_skills/ on the filesystem.
| Variable | Purpose |
|---|---|
GCS_EXPLORER_PASSWORD | Login password (default: changeme) |
POLARIS_GCS_CLIENT_ID | Google Cloud Polaris OAuth client ID |
POLARIS_GCS_CLIENT_SECRET | Google Cloud Polaris OAuth client secret |
POLARIS_AZURE_CLIENT_ID | Azure Polaris OAuth client ID |
POLARIS_AZURE_CLIENT_SECRET | Azure Polaris OAuth client secret |
POLARIS_AWS_CLIENT_ID | AWS Polaris OAuth client ID |
POLARIS_AWS_CLIENT_SECRET | AWS Polaris OAuth client secret |
/usr/sap/gcs_explorer.env and loaded by systemd via
EnvironmentFile. The current production server still has credentials hardcoded in its
deployed copy and works without the env file.
DuckDB's C extension holds the Python GIL (Global Interpreter Lock) during network I/O
operations such as ATTACH TYPE ICEBERG and iceberg_scan over S3. In the old
in-process design, this froze ALL Python threads — including the HTTP server's accept loop. No amount
of Python-level locking or watchdogging could prevent this because the GIL itself was blocked in C code.
Solution: Moving DuckDB to a separate subprocess gives it its own GIL that cannot affect
the HTTP server. The two processes communicate via multiprocessing.Queue using a command/response pattern.
| Command | Description | Timeout |
|---|---|---|
exec_sql | Execute a single SQL query | 50s |
exec_multi | Execute multiple SQL statements (ATTACH, etc.) | 65s |
register_table | Register a PyArrow table in DuckDB via IPC serialization | 30s |
unregister_table | Remove a table from DuckDB | 15s |
init | Initialize DuckDB with extensions | — |
interrupt | Cancel current operation | — |
shutdown | Gracefully terminate the worker | — |
# Main -> Worker (via cmd_q) {"id": "uuid", "cmd": "exec_sql", "args": {"query": "SELECT ..."}} # Worker -> Main (via resp_q) {"id": "uuid", "status": "ok", "result": {"cols": [...], "data": [[...]]}}
When a DuckDB operation hangs (e.g., iceberg_scan over unreachable S3):
send_command() timeout firesDuckDBWorker._kill_and_respawn() sends SIGKILL to the subprocessloaded_tables and polaris_catalogs are cleared
The _maybe_evict() function is called before loading any new table. It checks system memory
usage via psutil and evicts the oldest tables from the loaded_tables
OrderedDict (LRU order) until system memory drops below 85%.
# Memory threshold MEMORY_THRESHOLD = 0.85 # evict at 85% system memory
def init_duckdb():
# Ensure HOME is set - systemd services may not have it
if not os.environ.get("HOME"):
os.environ["HOME"] = "/root"
db_conn = duckdb.connect(":memory:")
db_conn.execute(f"SET home_directory='{os.environ['HOME']}';")
db_conn.execute("INSTALL httpfs; LOAD httpfs;")
db_conn.execute("INSTALL iceberg; LOAD iceberg;")
db_conn.execute("INSTALL azure; LOAD azure;")
db_conn.execute("SET azure_transport_option_type='curl';") # Fixes SSL on SUSE
SET home_directory, DuckDB cannot install extensions when running under systemd
(which strips HOME). This causes "Can't find the home directory at ''" errors when
connecting to Polaris catalogs.
The explorer uses the DuckDB Iceberg extension for querying Iceberg tables and the Polaris REST catalog for fast namespace and table listing. Three catalog presets are pre-configured for GCS, Azure, and AWS.
| Preset | Cloud | Catalog Name | Auth |
|---|---|---|---|
gcs | Google Cloud Storage | obeisance_plaintive | OAuth2 client credentials + gcloud auth |
azure | Azure Data Lake | log_pseudo | OAuth2 client credentials + CREDENTIAL_CHAIN |
aws | AWS S3 | surfacing_caramel | OAuth2 + Polaris vended S3 credentials |
Namespace and table listing uses the Polaris REST API directly instead of DuckDB
information_schema queries. This is approximately 250x faster
(0.3s vs 70s for Azure).
# Key REST API functions def _polaris_oauth_token(endpoint, client_id, client_secret): """POST to {endpoint}/v1/oauth/tokens with client_credentials grant.""" def _polaris_rest_get(endpoint, path, token): """GET {endpoint}{path} with Bearer token.""" def list_polaris_namespaces(alias): """GET /v1/{catalog}/namespaces""" def list_polaris_tables(alias, namespace): """GET /v1/{catalog}/namespaces/{ns}/tables"""
DuckDB's Iceberg extension does NOT send the X-Iceberg-Access-Delegation: vended-credentials
header when querying via ATTACH ... TYPE ICEBERG. The workaround implemented in
_rewrite_aws_query():
aws.namespace.table)metadata-location via the Polaris REST API with vended-credentials headerconfig fieldiceberg_scan('s3://...metadata.json')# User writes: SELECT * FROM aws.sap_hana_db_saphanadb.mara LIMIT 100 # Server rewrites to: SELECT * FROM iceberg_scan('s3://ts-s3-multiformat-dest-demo/sap_hana_db_saphanadb/mara/metadata/00092-xxx.metadata.json') LIMIT 100
The server version includes a Manage Credentials UI panel that allows updating
OAuth2 client_id and client_secret at runtime via POST /api/polaris/update_credentials.
Changes persist in memory until server restart. If a provider is currently connected, it is
automatically disconnected so the next connect uses the new credentials.
All routes are under the BASE_PATH prefix (/datalake_reader on server).
| Endpoint | Function | Description |
|---|---|---|
/ | — | Serves HTML_PAGE |
/login | — | Serves LOGIN_PAGE (server only) |
/logout | — | Clears session cookie (server only) |
/api/init | init_gcs() | Initialize GCS client, return status + last state |
/api/buckets | list_buckets() | List all accessible GCS buckets |
/api/ls?prefix=&bucket= | list_path() | List dirs/files under a GCS prefix with stats |
/api/parquet?path= | read_parquet() | Read single parquet file, register in DuckDB |
/api/load_dir?prefix= | read_all_parquets_in_dir() | Read all parquets in a data/ dir, combine |
/api/sql?query=&prefix= | run_sql() | Execute SQL via DuckDB, auto-load missing tables |
/api/cat?path= | read_file_text() | Read a text file from GCS |
/api/tables | get_loaded_tables() | List tables registered in DuckDB |
/api/health | — | Health check (no auth, no DuckDB) |
/api/auth | run_gcloud_auth() | Run gcloud auth application-default login |
/api/azure_auth | run_azure_auth() | Create DuckDB Azure credential chain secret |
/api/aws_auth?mode= | run_aws_auth() | Create DuckDB S3 secret (role or keys) |
/api/polaris/connect | connect_polaris() | Attach Iceberg catalog to DuckDB |
/api/polaris/disconnect | disconnect_polaris() | Detach catalog |
/api/polaris/catalogs | get_connected_catalogs() | List connected catalogs (no secrets) |
/api/polaris/presets | get_catalog_presets() | Return presets with masked secrets |
/api/polaris/namespaces?alias= | list_polaris_namespaces() | List namespaces via REST API |
/api/polaris/tables?alias=&namespace= | list_polaris_tables() | List tables via REST API |
/api/polaris/all_tables | list_all_polaris_tables() | All tables across all connected catalogs |
| Endpoint | Function | Description |
|---|---|---|
/api/login | — | Email login, sets session cookie (server only) |
/api/restart | — | Restart server via systemctl restart gcs-explorer |
/api/polaris/update_credentials | update_polaris_credentials() | Update preset credentials at runtime |
# 1. Upload to server scp /tmp/gcs_explorer_server.py sapidesecc8:/tmp/gcs_explorer_server.py # 2. Deploy and restart ssh sapidesecc8 "sudo cp /tmp/gcs_explorer_server.py /usr/sap/gcs_explorer_server.py; sudo systemctl restart gcs-explorer" # 3. Verify ssh sapidesecc8 "sudo systemctl status gcs-explorer --no-pager"
A separate dev instance runs alongside production on port 8443:
| Property | Production | Dev |
|---|---|---|
| Port | 443 | 8443 |
| URL prefix | /datalake_reader/ | /datalake_reader_dev/ |
| Service | gcs-explorer.service | gcs-explorer-dev.service |
| Server file | /usr/sap/gcs_explorer_server.py | /usr/sap/gcs_explorer_server_dev.py |
| Log file | /var/log/gcs_explorer.log | /var/log/gcs_explorer_dev.log |
# Dev deployment
scp /tmp/gcs_explorer_server_dev.py sapidesecc8:/tmp/
ssh sapidesecc8 "sudo cp /tmp/gcs_explorer_server_dev.py /usr/sap/gcs_explorer_server_dev.py; sudo systemctl restart gcs-explorer-dev"
sudo systemctl start gcs-explorer # Start sudo systemctl stop gcs-explorer # Stop sudo systemctl restart gcs-explorer # Restart (after deploy) sudo systemctl status gcs-explorer # Check status sudo systemctl enable gcs-explorer # Enable on boot (already done) journalctl -u gcs-explorer -f # Follow logs
/usr/sap/gcs_explorer_cert.pem (ZeroSSL, chained: server + intermediate CA)/usr/sap/gcs_explorer_key.pemProactive measures (prevent freezes):
| Measure | What It Prevents |
|---|---|
| Subprocess isolation (separate GIL) | DuckDB C code can never freeze the HTTP server |
| Per-command timeouts (65s ATTACH, 50s query) | Hung operations auto-kill the worker |
Skip-if-already-connected in connect_polaris() | Redundant ATTACH calls that caused contention |
| Auto-respawn on worker death | Fresh DuckDB after any crash, no manual restart |
/api/health endpoint (no auth, no DuckDB) | Always responds even when worker is hung |
Reactive backup measures:
| Measure | What It Catches |
|---|---|
| External systemd watchdog timer (every 60s) | Curls /api/health, restarts if unresponsive |
systemd Restart=always | Process-level crashes |
| BrokenPipeError handling | Client disconnects during response |
| Socket backlog = 64 | Handles queued connections under load |
| Issue | Cause | Solution |
|---|---|---|
| Port already in use | Previous process did not release port 443 | sudo lsof -i :443 to find PID, then kill -9 <PID>. SO_REUSEADDR is set but may not help if old process is alive. |
| DuckDB freezes / hangs | C extension holding GIL during network I/O | Subprocess architecture auto-recovers. Worker is killed and respawned on timeout. If the entire server freezes, the systemd watchdog restarts it within 60s. |
| SSL certificate errors | Expired or missing ZeroSSL cert | Check expiry: openssl x509 -enddate -noout -in /usr/sap/gcs_explorer_cert.pem. Renew via ZeroSSL dashboard, replace PEM files, restart service. |
| GCS credentials error | Missing or expired Application Default Credentials | Re-run gcloud auth application-default login (local) or check the service account key on the server. |
| Memory pressure / OOM | Too many large Parquet tables loaded | LRU eviction triggers automatically at 85% memory. Manually unload tables via the UI or restart the service. |
| SUSE Python path issues | sudo uses /usr/bin/python3 without duckdb/pyarrow |
Always use /root/miniconda/bin/python3 in service files and direct execution. Never rely on system Python. |
| Azure SSL errors on SUSE | Default SSL transport fails on SUSE Linux | Set azure_transport_option_type='curl' in DuckDB init. The server auto-detects CA bundle paths. |
| DuckDB extension install fails under systemd | Missing HOME environment variable |
Set os.environ["HOME"] = "/root" and SET home_directory in DuckDB before installing extensions. |
| AWS queries hang forever | DuckDB Iceberg ext doesn't send vended-credentials header | The _rewrite_aws_query() workaround automatically resolves metadata via REST API and rewrites queries to use iceberg_scan(). |
| Worker respawn loses state | All DuckDB secrets, catalogs, and tables are lost on respawn | Expected behavior. Users need to reconnect catalogs and reload tables after a worker restart. |