Go back to the main page

GCS Parquet Explorer

Datalake Reader & Iceberg Table Explorer
SAP S/4HANA 2023

Overview

GCS Parquet Explorer is a single-file Python web application for browsing and querying SAP CDS Parquet data across Google Cloud Storage, Azure Data Lake, and AWS S3. It provides an interactive web UI with DuckDB-powered SQL queries, schema inspection, and Iceberg table support via Polaris REST catalogs.

No pandas dependency. The entire stack runs on DuckDB + PyArrow for maximum performance and minimal memory footprint. DuckDB runs in a subprocess-isolated architecture to prevent GIL contention.

Application

Type
Single-file Python web app
GitHub
fivetran-antoniocarbone/gcs-parquet-explorer
Live URL
sapidesecc8/datalake_reader/
SQL Engine
DuckDB (subprocess-isolated)
Parquet
PyArrow (no pandas)

Two Versions

Local
gcs_explorer.py — port 8765, HTTP
Server
gcs_explorer_server.py — port 443, HTTPS
Cloud Support
GCS, Azure Data Lake, AWS S3
Iceberg
DuckDB extension + Polaris REST
Default Bucket
sap_cds_dbt

Key Features

  • Browse GCS buckets and Parquet files
  • SQL queries via DuckDB
  • Schema inspection and Parquet metadata
  • Multi-cloud support (GCS, Azure, S3)
  • Iceberg table support via Polaris catalogs
  • Memory-aware caching with LRU eviction
  • Responsive Fivetran-themed UI
  • AWS vended credentials with query rewrite

Architecture

Single-File Design

Everything lives in one .py file: Python backend + embedded HTML/CSS/JS frontend in a raw string (HTML_PAGE = r"""..."""). Because the HTML is in a Python raw string, \u unicode escapes do NOT work in CSS/JS — use literal UTF-8 characters or HTML entities.

Technology Stack

LayerTechnology
Serverhttp.server.ThreadingHTTPServer (multi-threaded)
DuckDB IsolationSeparate subprocess via multiprocessing.Process (GIL-isolated)
SQL EngineDuckDB (in-memory, with httpfs + iceberg + azure extensions)
Azure TransportSET azure_transport_option_type='curl' (fixes SSL on SUSE)
ParquetPyArrow (pyarrow.parquet) — no pandas dependency
GCSgoogle-cloud-storage with Application Default Credentials
IcebergDuckDB Iceberg extension + Polaris REST catalog
Polaris REST APIDirect HTTP calls for fast namespace/table listing (~250x faster)
Memorypsutil for LRU eviction at 85% system memory
AuthEmail-based login with session cookies (server version only)
SSLZeroSSL certificate with chained PEM (server version only)
UIEmbedded HTML/CSS/JS, Fivetran light theme, no frameworks

Server vs Local Differences

FeatureLocal (gcs_explorer.py)Server (gcs_explorer_server.py)
Port8765 (HTTP)443 (HTTPS)
URL prefix//datalake_reader/
LoginNoneEmail-based (LOGIN_PASSWORD)
SSLNoneZeroSSL cert + key
ThreadingHTTPServerThreadingHTTPServer
DuckDB processIn-processSeparate subprocess (DuckDBWorker)
DuckDB commsDirect callsmultiprocessing.Queue (command/response)
Hang recoveryNoneAuto-kill + respawn worker on timeout
Socket backlogDefault (5)64 (request_queue_size = 64)
Azure transportDefaultcurl (fixes SSL on SUSE)
Polaris listingDuckDB information_schemaDirect Polaris REST API
Credential mgmtNoneUI panel to update client_id/secret
Process mgmtManualsystemd (auto-restart, boot start)

Local Version

Configuration

File
gcs_explorer.py
Port
8765 (HTTP)
Auth
None
Default Bucket
sap_cds_dbt
Default Prefix
sap_cds_views/
State File
~/.gcs_explorer_state.json
DuckDB
In-process (direct calls)
Python
3.10+ required

Requirements

duckdb
pyarrow
google-cloud-storage
psutil

Setup

# Install dependencies
pip install duckdb pyarrow google-cloud-storage psutil

# Authenticate with GCS
gcloud auth application-default login

# Run the explorer
python gcs_explorer.py
The local version stores the last-browsed bucket and prefix in ~/.gcs_explorer_state.json so it resumes where you left off on next launch.

Server Version (sapidesecc8)

Host Details

Hostname
sapidesecc8.fivetran-internal-sales.com
OS
SUSE Linux Enterprise Server 15 SP5
Python
/root/miniconda/bin/python3 (3.13.9)
Port
443 (HTTPS)
URL
sapidesecc8/datalake_reader/
SSL
ZeroSSL (expires Jun 25, 2026)
Login
(see vault)
SSH
root@sapidesecc8
Important: sudo on sapidesecc8 uses /usr/bin/python3 which does NOT have duckdb/pyarrow. Always use /root/miniconda/bin/python3 in service files and direct execution.

Server Files

FilePurpose
/usr/sap/gcs_explorer_server.pyDeployed production copy (port 443)
/usr/sap/gcs_explorer_server_dev.pyDeployed dev copy (port 8443)
/usr/sap/gcs_explorer_cert.pemSSL certificate (ZeroSSL, chained: server + intermediate CA)
/usr/sap/gcs_explorer_key.pemSSL private key
/usr/sap/gcs_explorer.envEnvironment variables (secrets)
/etc/systemd/system/gcs-explorer.servicesystemd service unit (production)
/etc/systemd/system/gcs-explorer-dev.servicesystemd service unit (dev)
/var/log/gcs_explorer.logServer log file (production)
/var/log/gcs_explorer_dev.logServer log file (dev)

systemd Service

[Unit]
Description=GCS Parquet Explorer Web Server
After=network.target

[Service]
Type=simple
WorkingDirectory=/usr/sap
EnvironmentFile=-/usr/sap/gcs_explorer.env
ExecStart=/root/miniconda/bin/python3 -u /usr/sap/gcs_explorer_server.py
Restart=always
RestartSec=5
StandardOutput=append:/var/log/gcs_explorer.log
StandardError=append:/var/log/gcs_explorer.log

[Install]
WantedBy=multi-user.target

Features: auto-restart on crash (5s delay), auto-start on reboot, unbuffered output (-u) for immediate log visibility.

Additional Routes

The server also hosts the SAP Skills Portal static files at the /sap_skills/ route, serving HTML/CSS/JS from /usr/sap/sap_skills/ on the filesystem.

Environment Variables

VariablePurpose
GCS_EXPLORER_PASSWORDLogin password (default: changeme)
POLARIS_GCS_CLIENT_IDGoogle Cloud Polaris OAuth client ID
POLARIS_GCS_CLIENT_SECRETGoogle Cloud Polaris OAuth client secret
POLARIS_AZURE_CLIENT_IDAzure Polaris OAuth client ID
POLARIS_AZURE_CLIENT_SECRETAzure Polaris OAuth client secret
POLARIS_AWS_CLIENT_IDAWS Polaris OAuth client ID
POLARIS_AWS_CLIENT_SECRETAWS Polaris OAuth client secret
On the server, these are stored in /usr/sap/gcs_explorer.env and loaded by systemd via EnvironmentFile. The current production server still has credentials hardcoded in its deployed copy and works without the env file.

DuckDB Subprocess Architecture

Why Subprocess Isolation?

DuckDB's C extension holds the Python GIL (Global Interpreter Lock) during network I/O operations such as ATTACH TYPE ICEBERG and iceberg_scan over S3. In the old in-process design, this froze ALL Python threads — including the HTTP server's accept loop. No amount of Python-level locking or watchdogging could prevent this because the GIL itself was blocked in C code.

Solution: Moving DuckDB to a separate subprocess gives it its own GIL that cannot affect the HTTP server. The two processes communicate via multiprocessing.Queue using a command/response pattern.

Process Architecture

┌─ Main Process (HTTP Server) ─────────┐ ┌─ Child Process (DuckDB) ───────┐ │ ThreadingHTTPServer │ │ _duckdb_worker() │ │ All HTTP handlers │ │ Owns db_conn (duckdb.connect) │ │ GCS operations (list, download) │ Q │ Owns arrow_tables dict │ │ Polaris REST API calls │───>│ Command loop: │ │ DuckDBWorker (sends commands) │<───│ exec_sql, exec_multi, │ │ loaded_tables (metadata only) │ Q │ register_table, │ │ polaris_catalogs dict │ │ unregister_table │ └───────────────────────────────────────┘ └───────────────────────────────┘

Commands

CommandDescriptionTimeout
exec_sqlExecute a single SQL query50s
exec_multiExecute multiple SQL statements (ATTACH, etc.)65s
register_tableRegister a PyArrow table in DuckDB via IPC serialization30s
unregister_tableRemove a table from DuckDB15s
initInitialize DuckDB with extensions
interruptCancel current operation
shutdownGracefully terminate the worker

Command Protocol

# Main -> Worker (via cmd_q)
{"id": "uuid", "cmd": "exec_sql", "args": {"query": "SELECT ..."}}

# Worker -> Main (via resp_q)
{"id": "uuid", "status": "ok", "result": {"cols": [...], "data": [[...]]}}

Hang Recovery

When a DuckDB operation hangs (e.g., iceberg_scan over unreachable S3):

  1. The send_command() timeout fires
  2. DuckDBWorker._kill_and_respawn() sends SIGKILL to the subprocess
  3. All pending waiters are woken with "worker restarted" error
  4. loaded_tables and polaris_catalogs are cleared
  5. A fresh subprocess starts with clean DuckDB
  6. The HTTP server never stops responding

Memory Management

The _maybe_evict() function is called before loading any new table. It checks system memory usage via psutil and evicts the oldest tables from the loaded_tables OrderedDict (LRU order) until system memory drops below 85%.

# Memory threshold
MEMORY_THRESHOLD = 0.85  # evict at 85% system memory

DuckDB Initialization

def init_duckdb():
    # Ensure HOME is set - systemd services may not have it
    if not os.environ.get("HOME"):
        os.environ["HOME"] = "/root"
    db_conn = duckdb.connect(":memory:")
    db_conn.execute(f"SET home_directory='{os.environ['HOME']}';")
    db_conn.execute("INSTALL httpfs; LOAD httpfs;")
    db_conn.execute("INSTALL iceberg; LOAD iceberg;")
    db_conn.execute("INSTALL azure; LOAD azure;")
    db_conn.execute("SET azure_transport_option_type='curl';")  # Fixes SSL on SUSE
Without SET home_directory, DuckDB cannot install extensions when running under systemd (which strips HOME). This causes "Can't find the home directory at ''" errors when connecting to Polaris catalogs.

Iceberg & Polaris Integration

Overview

The explorer uses the DuckDB Iceberg extension for querying Iceberg tables and the Polaris REST catalog for fast namespace and table listing. Three catalog presets are pre-configured for GCS, Azure, and AWS.

Catalog Presets

PresetCloudCatalog NameAuth
gcsGoogle Cloud Storageobeisance_plaintiveOAuth2 client credentials + gcloud auth
azureAzure Data Lakelog_pseudoOAuth2 client credentials + CREDENTIAL_CHAIN
awsAWS S3surfacing_caramelOAuth2 + Polaris vended S3 credentials

Polaris REST API (Fast Listing)

Namespace and table listing uses the Polaris REST API directly instead of DuckDB information_schema queries. This is approximately 250x faster (0.3s vs 70s for Azure).

# Key REST API functions
def _polaris_oauth_token(endpoint, client_id, client_secret):
    """POST to {endpoint}/v1/oauth/tokens with client_credentials grant."""

def _polaris_rest_get(endpoint, path, token):
    """GET {endpoint}{path} with Bearer token."""

def list_polaris_namespaces(alias):
    """GET /v1/{catalog}/namespaces"""

def list_polaris_tables(alias, namespace):
    """GET /v1/{catalog}/namespaces/{ns}/tables"""
DuckDB ATTACH is still used for actual SQL queries — only the browsing/listing is via REST.

AWS Vended Credentials & Query Rewrite

DuckDB's Iceberg extension does NOT send the X-Iceberg-Access-Delegation: vended-credentials header when querying via ATTACH ... TYPE ICEBERG. The workaround implemented in _rewrite_aws_query():

  1. Detect SQL query references an AWS Polaris catalog (e.g., aws.namespace.table)
  2. Resolve the table's metadata-location via the Polaris REST API with vended-credentials header
  3. Fetch vended S3 credentials from the table response config field
  4. Create a DuckDB S3 secret with the vended credentials
  5. Rewrite the query to use iceberg_scan('s3://...metadata.json')
# User writes:
SELECT * FROM aws.sap_hana_db_saphanadb.mara LIMIT 100

# Server rewrites to:
SELECT * FROM iceberg_scan('s3://ts-s3-multiformat-dest-demo/sap_hana_db_saphanadb/mara/metadata/00092-xxx.metadata.json') LIMIT 100

Credential Management

The server version includes a Manage Credentials UI panel that allows updating OAuth2 client_id and client_secret at runtime via POST /api/polaris/update_credentials. Changes persist in memory until server restart. If a provider is currently connected, it is automatically disconnected so the next connect uses the new credentials.

API Endpoints

GET Endpoints

All routes are under the BASE_PATH prefix (/datalake_reader on server).

EndpointFunctionDescription
/Serves HTML_PAGE
/loginServes LOGIN_PAGE (server only)
/logoutClears session cookie (server only)
/api/initinit_gcs()Initialize GCS client, return status + last state
/api/bucketslist_buckets()List all accessible GCS buckets
/api/ls?prefix=&bucket=list_path()List dirs/files under a GCS prefix with stats
/api/parquet?path=read_parquet()Read single parquet file, register in DuckDB
/api/load_dir?prefix=read_all_parquets_in_dir()Read all parquets in a data/ dir, combine
/api/sql?query=&prefix=run_sql()Execute SQL via DuckDB, auto-load missing tables
/api/cat?path=read_file_text()Read a text file from GCS
/api/tablesget_loaded_tables()List tables registered in DuckDB
/api/healthHealth check (no auth, no DuckDB)
/api/authrun_gcloud_auth()Run gcloud auth application-default login
/api/azure_authrun_azure_auth()Create DuckDB Azure credential chain secret
/api/aws_auth?mode=run_aws_auth()Create DuckDB S3 secret (role or keys)
/api/polaris/connectconnect_polaris()Attach Iceberg catalog to DuckDB
/api/polaris/disconnectdisconnect_polaris()Detach catalog
/api/polaris/catalogsget_connected_catalogs()List connected catalogs (no secrets)
/api/polaris/presetsget_catalog_presets()Return presets with masked secrets
/api/polaris/namespaces?alias=list_polaris_namespaces()List namespaces via REST API
/api/polaris/tables?alias=&namespace=list_polaris_tables()List tables via REST API
/api/polaris/all_tableslist_all_polaris_tables()All tables across all connected catalogs

POST Endpoints

EndpointFunctionDescription
/api/loginEmail login, sets session cookie (server only)
/api/restartRestart server via systemctl restart gcs-explorer
/api/polaris/update_credentialsupdate_polaris_credentials()Update preset credentials at runtime

Deployment

Production Deployment Workflow

# 1. Upload to server
scp /tmp/gcs_explorer_server.py sapidesecc8:/tmp/gcs_explorer_server.py

# 2. Deploy and restart
ssh sapidesecc8 "sudo cp /tmp/gcs_explorer_server.py /usr/sap/gcs_explorer_server.py; sudo systemctl restart gcs-explorer"

# 3. Verify
ssh sapidesecc8 "sudo systemctl status gcs-explorer --no-pager"

Dev Deployment

A separate dev instance runs alongside production on port 8443:

PropertyProductionDev
Port4438443
URL prefix/datalake_reader//datalake_reader_dev/
Servicegcs-explorer.servicegcs-explorer-dev.service
Server file/usr/sap/gcs_explorer_server.py/usr/sap/gcs_explorer_server_dev.py
Log file/var/log/gcs_explorer.log/var/log/gcs_explorer_dev.log
# Dev deployment
scp /tmp/gcs_explorer_server_dev.py sapidesecc8:/tmp/
ssh sapidesecc8 "sudo cp /tmp/gcs_explorer_server_dev.py /usr/sap/gcs_explorer_server_dev.py; sudo systemctl restart gcs-explorer-dev"

Service Management

sudo systemctl start gcs-explorer       # Start
sudo systemctl stop gcs-explorer        # Stop
sudo systemctl restart gcs-explorer     # Restart (after deploy)
sudo systemctl status gcs-explorer      # Check status
sudo systemctl enable gcs-explorer      # Enable on boot (already done)
journalctl -u gcs-explorer -f           # Follow logs

SSL Certificate

Certificate
/usr/sap/gcs_explorer_cert.pem (ZeroSSL, chained: server + intermediate CA)
Private Key
/usr/sap/gcs_explorer_key.pem
Expiration
June 25, 2026

Server Stability & Resilience

Proactive measures (prevent freezes):

MeasureWhat It Prevents
Subprocess isolation (separate GIL)DuckDB C code can never freeze the HTTP server
Per-command timeouts (65s ATTACH, 50s query)Hung operations auto-kill the worker
Skip-if-already-connected in connect_polaris()Redundant ATTACH calls that caused contention
Auto-respawn on worker deathFresh DuckDB after any crash, no manual restart
/api/health endpoint (no auth, no DuckDB)Always responds even when worker is hung

Reactive backup measures:

MeasureWhat It Catches
External systemd watchdog timer (every 60s)Curls /api/health, restarts if unresponsive
systemd Restart=alwaysProcess-level crashes
BrokenPipeError handlingClient disconnects during response
Socket backlog = 64Handles queued connections under load

Troubleshooting

IssueCauseSolution
Port already in use Previous process did not release port 443 sudo lsof -i :443 to find PID, then kill -9 <PID>. SO_REUSEADDR is set but may not help if old process is alive.
DuckDB freezes / hangs C extension holding GIL during network I/O Subprocess architecture auto-recovers. Worker is killed and respawned on timeout. If the entire server freezes, the systemd watchdog restarts it within 60s.
SSL certificate errors Expired or missing ZeroSSL cert Check expiry: openssl x509 -enddate -noout -in /usr/sap/gcs_explorer_cert.pem. Renew via ZeroSSL dashboard, replace PEM files, restart service.
GCS credentials error Missing or expired Application Default Credentials Re-run gcloud auth application-default login (local) or check the service account key on the server.
Memory pressure / OOM Too many large Parquet tables loaded LRU eviction triggers automatically at 85% memory. Manually unload tables via the UI or restart the service.
SUSE Python path issues sudo uses /usr/bin/python3 without duckdb/pyarrow Always use /root/miniconda/bin/python3 in service files and direct execution. Never rely on system Python.
Azure SSL errors on SUSE Default SSL transport fails on SUSE Linux Set azure_transport_option_type='curl' in DuckDB init. The server auto-detects CA bundle paths.
DuckDB extension install fails under systemd Missing HOME environment variable Set os.environ["HOME"] = "/root" and SET home_directory in DuckDB before installing extensions.
AWS queries hang forever DuckDB Iceberg ext doesn't send vended-credentials header The _rewrite_aws_query() workaround automatically resolves metadata via REST API and rewrites queries to use iceberg_scan().
Worker respawn loses state All DuckDB secrets, catalogs, and tables are lost on respawn Expected behavior. Users need to reconnect catalogs and reload tables after a worker restart.