GCS Parquet Explorer Documentation

Overview

GCS Parquet Explorer is a single-file Python web application for browsing and querying SAP CDS Parquet data across Google Cloud Storage, Azure Data Lake, and AWS S3. It provides an interactive web UI with DuckDB-powered SQL queries, schema inspection, and Iceberg table support via Polaris REST catalogs.

No pandas dependency. The entire stack runs on DuckDB + PyArrow for maximum performance and minimal memory footprint. DuckDB runs in a subprocess-isolated architecture to prevent GIL contention.

Application

Type: Single-file Python web app
GitHub: fivetran-antoniocarbone/gcs-parquet-explorer
Live URL: sapidesecc8/datalake_reader/
SQL Engine: DuckDB (subprocess-isolated)
Parquet: PyArrow (no pandas)

Two Versions

Local: gcs_explorer.py — port 8765, HTTP
Server: gcs_explorer_server.py — port 443, HTTPS
Cloud Support: GCS, Azure Data Lake, AWS S3
Iceberg: DuckDB extension + Polaris REST
Default Bucket: sap_cds_dbt

Key Features

Browse GCS buckets and Parquet files
SQL queries via DuckDB
Schema inspection and Parquet metadata
Multi-cloud support (GCS, Azure, S3)

Iceberg table support via Polaris catalogs
Memory-aware caching with LRU eviction
Responsive Fivetran-themed UI
AWS vended credentials with query rewrite

Architecture

Single-File Design

Everything lives in one .py file: Python backend + embedded HTML/CSS/JS frontend in a raw string (HTML_PAGE = r"""..."""). Because the HTML is in a Python raw string, \u unicode escapes do NOT work in CSS/JS — use literal UTF-8 characters or HTML entities.

Technology Stack

Layer	Technology
Server	`http.server.ThreadingHTTPServer` (multi-threaded)
DuckDB Isolation	Separate subprocess via `multiprocessing.Process` (GIL-isolated)
SQL Engine	DuckDB (in-memory, with httpfs + iceberg + azure extensions)
Azure Transport	`SET azure_transport_option_type='curl'` (fixes SSL on SUSE)
Parquet	PyArrow (`pyarrow.parquet`) — no pandas dependency
GCS	`google-cloud-storage` with Application Default Credentials
Iceberg	DuckDB Iceberg extension + Polaris REST catalog
Polaris REST API	Direct HTTP calls for fast namespace/table listing (~250x faster)
Memory	`psutil` for LRU eviction at 85% system memory
Auth	Email-based login with session cookies (server version only)
SSL	ZeroSSL certificate with chained PEM (server version only)
UI	Embedded HTML/CSS/JS, Fivetran light theme, no frameworks

Server vs Local Differences

Feature	Local (`gcs_explorer.py`)	Server (`gcs_explorer_server.py`)
Port	8765 (HTTP)	443 (HTTPS)
URL prefix	`/`	`/datalake_reader/`
Login	None	Email-based (`LOGIN_PASSWORD`)
SSL	None	ZeroSSL cert + key
Threading	`HTTPServer`	`ThreadingHTTPServer`
DuckDB process	In-process	Separate subprocess (`DuckDBWorker`)
DuckDB comms	Direct calls	`multiprocessing.Queue` (command/response)
Hang recovery	None	Auto-kill + respawn worker on timeout
Socket backlog	Default (5)	64 (`request_queue_size = 64`)
Azure transport	Default	`curl` (fixes SSL on SUSE)
Polaris listing	DuckDB `information_schema`	Direct Polaris REST API
Credential mgmt	None	UI panel to update client_id/secret
Process mgmt	Manual	systemd (auto-restart, boot start)

Local Version

Configuration

File: gcs_explorer.py
Port: 8765 (HTTP)
Auth: None
Default Bucket: sap_cds_dbt

Default Prefix: sap_cds_views/
State File: ~/.gcs_explorer_state.json
DuckDB: In-process (direct calls)
Python: 3.10+ required

Requirements

duckdb
pyarrow
google-cloud-storage
psutil

Setup

# Install dependencies
pip install duckdb pyarrow google-cloud-storage psutil

# Authenticate with GCS
gcloud auth application-default login

# Run the explorer
python gcs_explorer.py

The local version stores the last-browsed bucket and prefix in ~/.gcs_explorer_state.json so it resumes where you left off on next launch.

Server Version (sapidesecc8)

Host Details

Hostname: sapidesecc8.fivetran-internal-sales.com
OS: SUSE Linux Enterprise Server 15 SP5
Python: /root/miniconda/bin/python3 (3.13.9)
Port: 443 (HTTPS)

URL: sapidesecc8/datalake_reader/
SSL: ZeroSSL (expires Jun 25, 2026)
Login: (see vault)
SSH: root@sapidesecc8

Important: sudo on sapidesecc8 uses /usr/bin/python3 which does NOT have duckdb/pyarrow. Always use /root/miniconda/bin/python3 in service files and direct execution.

Server Files

File	Purpose
`/usr/sap/gcs_explorer_server.py`	Deployed production copy (port 443)
`/usr/sap/gcs_explorer_server_dev.py`	Deployed dev copy (port 8443)
`/usr/sap/gcs_explorer_cert.pem`	SSL certificate (ZeroSSL, chained: server + intermediate CA)
`/usr/sap/gcs_explorer_key.pem`	SSL private key
`/usr/sap/gcs_explorer.env`	Environment variables (secrets)
`/etc/systemd/system/gcs-explorer.service`	systemd service unit (production)
`/etc/systemd/system/gcs-explorer-dev.service`	systemd service unit (dev)
`/var/log/gcs_explorer.log`	Server log file (production)
`/var/log/gcs_explorer_dev.log`	Server log file (dev)

systemd Service

[Unit]
Description=GCS Parquet Explorer Web Server
After=network.target

[Service]
Type=simple
WorkingDirectory=/usr/sap
EnvironmentFile=-/usr/sap/gcs_explorer.env
ExecStart=/root/miniconda/bin/python3 -u /usr/sap/gcs_explorer_server.py
Restart=always
RestartSec=5
StandardOutput=append:/var/log/gcs_explorer.log
StandardError=append:/var/log/gcs_explorer.log

[Install]
WantedBy=multi-user.target

Features: auto-restart on crash (5s delay), auto-start on reboot, unbuffered output (-u) for immediate log visibility.

Additional Routes

The server also hosts the SAP Skills Portal static files at the /sap_skills/ route, serving HTML/CSS/JS from /usr/sap/sap_skills/ on the filesystem.

Environment Variables

Variable	Purpose
`GCS_EXPLORER_PASSWORD`	Login password (default: `changeme`)
`POLARIS_GCS_CLIENT_ID`	Google Cloud Polaris OAuth client ID
`POLARIS_GCS_CLIENT_SECRET`	Google Cloud Polaris OAuth client secret
`POLARIS_AZURE_CLIENT_ID`	Azure Polaris OAuth client ID
`POLARIS_AZURE_CLIENT_SECRET`	Azure Polaris OAuth client secret
`POLARIS_AWS_CLIENT_ID`	AWS Polaris OAuth client ID
`POLARIS_AWS_CLIENT_SECRET`	AWS Polaris OAuth client secret

On the server, these are stored in /usr/sap/gcs_explorer.env and loaded by systemd via EnvironmentFile. The current production server still has credentials hardcoded in its deployed copy and works without the env file.

DuckDB Subprocess Architecture

Why Subprocess Isolation?

DuckDB's C extension holds the Python GIL (Global Interpreter Lock) during network I/O operations such as ATTACH TYPE ICEBERG and iceberg_scan over S3. In the old in-process design, this froze ALL Python threads — including the HTTP server's accept loop. No amount of Python-level locking or watchdogging could prevent this because the GIL itself was blocked in C code.

Solution: Moving DuckDB to a separate subprocess gives it its own GIL that cannot affect the HTTP server. The two processes communicate via multiprocessing.Queue using a command/response pattern.

Process Architecture

┌─ Main Process (HTTP Server) ─────────┐ ┌─ Child Process (DuckDB) ───────┐ │ ThreadingHTTPServer │ │ _duckdb_worker() │ │ All HTTP handlers │ │ Owns db_conn (duckdb.connect) │ │ GCS operations (list, download) │ Q │ Owns arrow_tables dict │ │ Polaris REST API calls │───>│ Command loop: │ │ DuckDBWorker (sends commands) │<───│ exec_sql, exec_multi, │ │ loaded_tables (metadata only) │ Q │ register_table, │ │ polaris_catalogs dict │ │ unregister_table │ └───────────────────────────────────────┘ └───────────────────────────────┘

Commands

Command	Description	Timeout
`exec_sql`	Execute a single SQL query	50s
`exec_multi`	Execute multiple SQL statements (ATTACH, etc.)	65s
`register_table`	Register a PyArrow table in DuckDB via IPC serialization	30s
`unregister_table`	Remove a table from DuckDB	15s
`init`	Initialize DuckDB with extensions	—
`interrupt`	Cancel current operation	—
`shutdown`	Gracefully terminate the worker	—

Command Protocol

# Main -> Worker (via cmd_q)
{"id": "uuid", "cmd": "exec_sql", "args": {"query": "SELECT ..."}}

# Worker -> Main (via resp_q)
{"id": "uuid", "status": "ok", "result": {"cols": [...], "data": [[...]]}}

Hang Recovery

When a DuckDB operation hangs (e.g., iceberg_scan over unreachable S3):

The send_command() timeout fires
DuckDBWorker._kill_and_respawn() sends SIGKILL to the subprocess
All pending waiters are woken with "worker restarted" error
loaded_tables and polaris_catalogs are cleared
A fresh subprocess starts with clean DuckDB
The HTTP server never stops responding

Memory Management

The _maybe_evict() function is called before loading any new table. It checks system memory usage via psutil and evicts the oldest tables from the loaded_tables OrderedDict (LRU order) until system memory drops below 85%.

# Memory threshold
MEMORY_THRESHOLD = 0.85  # evict at 85% system memory

DuckDB Initialization

def init_duckdb():
    # Ensure HOME is set - systemd services may not have it
    if not os.environ.get("HOME"):
        os.environ["HOME"] = "/root"
    db_conn = duckdb.connect(":memory:")
    db_conn.execute(f"SET home_directory='{os.environ['HOME']}';")
    db_conn.execute("INSTALL httpfs; LOAD httpfs;")
    db_conn.execute("INSTALL iceberg; LOAD iceberg;")
    db_conn.execute("INSTALL azure; LOAD azure;")
    db_conn.execute("SET azure_transport_option_type='curl';")  # Fixes SSL on SUSE

Without SET home_directory, DuckDB cannot install extensions when running under systemd (which strips HOME). This causes "Can't find the home directory at ''" errors when connecting to Polaris catalogs.

Iceberg & Polaris Integration

Overview

The explorer uses the DuckDB Iceberg extension for querying Iceberg tables and the Polaris REST catalog for fast namespace and table listing. Three catalog presets are pre-configured for GCS, Azure, and AWS.

Catalog Presets

Preset	Cloud	Catalog Name	Auth
`gcs`	Google Cloud Storage	`obeisance_plaintive`	OAuth2 client credentials + `gcloud auth`
`azure`	Azure Data Lake	`log_pseudo`	OAuth2 client credentials + `CREDENTIAL_CHAIN`
`aws`	AWS S3	`surfacing_caramel`	OAuth2 + Polaris vended S3 credentials

Polaris REST API (Fast Listing)

Namespace and table listing uses the Polaris REST API directly instead of DuckDB information_schema queries. This is approximately 250x faster (0.3s vs 70s for Azure).

# Key REST API functions
def _polaris_oauth_token(endpoint, client_id, client_secret):
    """POST to {endpoint}/v1/oauth/tokens with client_credentials grant."""

def _polaris_rest_get(endpoint, path, token):
    """GET {endpoint}{path} with Bearer token."""

def list_polaris_namespaces(alias):
    """GET /v1/{catalog}/namespaces"""

def list_polaris_tables(alias, namespace):
    """GET /v1/{catalog}/namespaces/{ns}/tables"""

DuckDB ATTACH is still used for actual SQL queries — only the browsing/listing is via REST.

AWS Vended Credentials & Query Rewrite

DuckDB's Iceberg extension does NOT send the X-Iceberg-Access-Delegation: vended-credentials header when querying via ATTACH ... TYPE ICEBERG. The workaround implemented in _rewrite_aws_query():

Detect SQL query references an AWS Polaris catalog (e.g., aws.namespace.table)
Resolve the table's metadata-location via the Polaris REST API with vended-credentials header
Fetch vended S3 credentials from the table response config field
Create a DuckDB S3 secret with the vended credentials
Rewrite the query to use iceberg_scan('s3://...metadata.json')

# User writes:
SELECT * FROM aws.sap_hana_db_saphanadb.mara LIMIT 100

# Server rewrites to:
SELECT * FROM iceberg_scan('s3://ts-s3-multiformat-dest-demo/sap_hana_db_saphanadb/mara/metadata/00092-xxx.metadata.json') LIMIT 100

Credential Management

The server version includes a Manage Credentials UI panel that allows updating OAuth2 client_id and client_secret at runtime via POST /api/polaris/update_credentials. Changes persist in memory until server restart. If a provider is currently connected, it is automatically disconnected so the next connect uses the new credentials.

API Endpoints

GET Endpoints

All routes are under the BASE_PATH prefix (/datalake_reader on server).

Endpoint	Function	Description
`/`	—	Serves HTML_PAGE
`/login`	—	Serves LOGIN_PAGE (server only)
`/logout`	—	Clears session cookie (server only)
`/api/init`	`init_gcs()`	Initialize GCS client, return status + last state
`/api/buckets`	`list_buckets()`	List all accessible GCS buckets
`/api/ls?prefix=&bucket=`	`list_path()`	List dirs/files under a GCS prefix with stats
`/api/parquet?path=`	`read_parquet()`	Read single parquet file, register in DuckDB
`/api/load_dir?prefix=`	`read_all_parquets_in_dir()`	Read all parquets in a data/ dir, combine
`/api/sql?query=&prefix=`	`run_sql()`	Execute SQL via DuckDB, auto-load missing tables
`/api/cat?path=`	`read_file_text()`	Read a text file from GCS
`/api/tables`	`get_loaded_tables()`	List tables registered in DuckDB
`/api/health`	—	Health check (no auth, no DuckDB)
`/api/auth`	`run_gcloud_auth()`	Run gcloud auth application-default login
`/api/azure_auth`	`run_azure_auth()`	Create DuckDB Azure credential chain secret
`/api/aws_auth?mode=`	`run_aws_auth()`	Create DuckDB S3 secret (role or keys)
`/api/polaris/connect`	`connect_polaris()`	Attach Iceberg catalog to DuckDB
`/api/polaris/disconnect`	`disconnect_polaris()`	Detach catalog
`/api/polaris/catalogs`	`get_connected_catalogs()`	List connected catalogs (no secrets)
`/api/polaris/presets`	`get_catalog_presets()`	Return presets with masked secrets
`/api/polaris/namespaces?alias=`	`list_polaris_namespaces()`	List namespaces via REST API
`/api/polaris/tables?alias=&namespace=`	`list_polaris_tables()`	List tables via REST API
`/api/polaris/all_tables`	`list_all_polaris_tables()`	All tables across all connected catalogs

POST Endpoints

Endpoint	Function	Description
`/api/login`	—	Email login, sets session cookie (server only)
`/api/restart`	—	Restart server via `systemctl restart gcs-explorer`
`/api/polaris/update_credentials`	`update_polaris_credentials()`	Update preset credentials at runtime

Deployment

Production Deployment Workflow

# 1. Upload to server
scp /tmp/gcs_explorer_server.py sapidesecc8:/tmp/gcs_explorer_server.py

# 2. Deploy and restart
ssh sapidesecc8 "sudo cp /tmp/gcs_explorer_server.py /usr/sap/gcs_explorer_server.py; sudo systemctl restart gcs-explorer"

# 3. Verify
ssh sapidesecc8 "sudo systemctl status gcs-explorer --no-pager"

Dev Deployment

A separate dev instance runs alongside production on port 8443:

Property	Production	Dev
Port	443	8443
URL prefix	`/datalake_reader/`	`/datalake_reader_dev/`
Service	`gcs-explorer.service`	`gcs-explorer-dev.service`
Server file	`/usr/sap/gcs_explorer_server.py`	`/usr/sap/gcs_explorer_server_dev.py`
Log file	`/var/log/gcs_explorer.log`	`/var/log/gcs_explorer_dev.log`

# Dev deployment
scp /tmp/gcs_explorer_server_dev.py sapidesecc8:/tmp/
ssh sapidesecc8 "sudo cp /tmp/gcs_explorer_server_dev.py /usr/sap/gcs_explorer_server_dev.py; sudo systemctl restart gcs-explorer-dev"

Service Management

sudo systemctl start gcs-explorer       # Start
sudo systemctl stop gcs-explorer        # Stop
sudo systemctl restart gcs-explorer     # Restart (after deploy)
sudo systemctl status gcs-explorer      # Check status
sudo systemctl enable gcs-explorer      # Enable on boot (already done)
journalctl -u gcs-explorer -f           # Follow logs

SSL Certificate

Certificate: /usr/sap/gcs_explorer_cert.pem (ZeroSSL, chained: server + intermediate CA)
Private Key: /usr/sap/gcs_explorer_key.pem
Expiration: June 25, 2026

Server Stability & Resilience

Proactive measures (prevent freezes):

Measure	What It Prevents
Subprocess isolation (separate GIL)	DuckDB C code can never freeze the HTTP server
Per-command timeouts (65s ATTACH, 50s query)	Hung operations auto-kill the worker
Skip-if-already-connected in `connect_polaris()`	Redundant ATTACH calls that caused contention
Auto-respawn on worker death	Fresh DuckDB after any crash, no manual restart
`/api/health` endpoint (no auth, no DuckDB)	Always responds even when worker is hung

Reactive backup measures:

Measure	What It Catches
External systemd watchdog timer (every 60s)	Curls `/api/health`, restarts if unresponsive
systemd `Restart=always`	Process-level crashes
BrokenPipeError handling	Client disconnects during response
Socket backlog = 64	Handles queued connections under load

Troubleshooting

Issue	Cause	Solution
Port already in use	Previous process did not release port 443	`sudo lsof -i :443` to find PID, then `kill -9 <PID>`. `SO_REUSEADDR` is set but may not help if old process is alive.
DuckDB freezes / hangs	C extension holding GIL during network I/O	Subprocess architecture auto-recovers. Worker is killed and respawned on timeout. If the entire server freezes, the systemd watchdog restarts it within 60s.
SSL certificate errors	Expired or missing ZeroSSL cert	Check expiry: `openssl x509 -enddate -noout -in /usr/sap/gcs_explorer_cert.pem`. Renew via ZeroSSL dashboard, replace PEM files, restart service.
GCS credentials error	Missing or expired Application Default Credentials	Re-run `gcloud auth application-default login` (local) or check the service account key on the server.
Memory pressure / OOM	Too many large Parquet tables loaded	LRU eviction triggers automatically at 85% memory. Manually unload tables via the UI or restart the service.
SUSE Python path issues	`sudo` uses `/usr/bin/python3` without duckdb/pyarrow	Always use `/root/miniconda/bin/python3` in service files and direct execution. Never rely on system Python.
Azure SSL errors on SUSE	Default SSL transport fails on SUSE Linux	Set `azure_transport_option_type='curl'` in DuckDB init. The server auto-detects CA bundle paths.
DuckDB extension install fails under systemd	Missing `HOME` environment variable	Set `os.environ["HOME"] = "/root"` and `SET home_directory` in DuckDB before installing extensions.
AWS queries hang forever	DuckDB Iceberg ext doesn't send vended-credentials header	The `_rewrite_aws_query()` workaround automatically resolves metadata via REST API and rewrites queries to use `iceberg_scan()`.
Worker respawn loses state	All DuckDB secrets, catalogs, and tables are lost on respawn	Expected behavior. Users need to reconnect catalogs and reload tables after a worker restart.