test post

A short field guide to surviving your first six months on a data team — the things that took me longer than they should have. Skim the headings, jump to whichever section is on fire today.

The pipeline you inherit

Every pipeline you inherit was written under deadline pressure by someone who has since left the team. Read the code; don't trust the wiki. The wiki is a snapshot of intent, not behavior.

Three things are almost always wrong:

  1. The schedule (Airflow cron lies about timezones)
  2. The retries (silent failures aren't failures)
  3. The owner (the listed owner left two reorgs ago)

A pipeline you can't reason about in 10 minutes is a pipeline that owns you, not the other way around.

What I check first


Code that ships vs. code that runs

Most pipeline bugs aren't logic errors. They're type coercions in places nobody expected. Below: the same idea in three languages, ranked by how easy it is to get wrong.

Bash — easy to get very wrong

# Fail loudly on the first error; print every command.
set -euo pipefail

INPUT_DATE="${1:-$(date -u +%F)}"
BQ_PROJECT="thaneesh-prod"

bq query \
  --use_legacy_sql=false \
  --destination_table "${BQ_PROJECT}:staging.events_${INPUT_DATE//-/}" \
  "SELECT * FROM \`${BQ_PROJECT}.raw.events\`
   WHERE _PARTITIONTIME = TIMESTAMP('${INPUT_DATE}')"

The set -euo pipefail line is the most important line in any bash file you'll ever ship. Skip it and silent failures will eat your weekend.

Python — the safe default for most pipelines

from datetime import date, timedelta
from google.cloud import bigquery

client = bigquery.Client(project="thaneesh-prod")

def backfill(target: date, lookback_days: int = 7) -> int:
    """Re-materialize the last N days. Returns rows written."""
    start = target - timedelta(days=lookback_days)
    job = client.query(
        f"""
        SELECT * FROM `thaneesh-prod.raw.events`
        WHERE _PARTITIONTIME BETWEEN
          TIMESTAMP('{start.isoformat()}') AND TIMESTAMP('{target.isoformat()}')
        """,
        job_config=bigquery.QueryJobConfig(
            destination=f"thaneesh-prod.staging.events_{target:%Y%m%d}",
            write_disposition="WRITE_TRUNCATE",
        ),
    )
    result = job.result()
    return result.total_rows or 0

TypeScript — when the orchestrator lives in Node

import { BigQuery } from "@google-cloud/bigquery";

const bq = new BigQuery({ projectId: "thaneesh-prod" });

export async function backfill(target: Date, lookbackDays = 7): Promise<number> {
  const start = new Date(target);
  start.setUTCDate(start.getUTCDate() - lookbackDays);

  const [job] = await bq.createQueryJob({
    query: `
      SELECT * FROM \`thaneesh-prod.raw.events\`
      WHERE _PARTITIONTIME BETWEEN
        TIMESTAMP(@start) AND TIMESTAMP(@target)
    `,
    params: { start: start.toISOString(), target: target.toISOString() },
    destination: bq.dataset("staging").table(`events_${ymd(target)}`),
    writeDisposition: "WRITE_TRUNCATE",
  });

  const [rows] = await job.getQueryResults();
  return rows.length;
}

const ymd = (d: Date) => d.toISOString().slice(0, 10).replace(/-/g, "");

SQL — what actually runs on the warehouse

MERGE `thaneesh-prod.warehouse.user_sessions` AS target
USING (
  SELECT
    user_id,
    session_id,
    MIN(event_ts) AS started_at,
    MAX(event_ts) AS ended_at,
    COUNT(*) AS event_count
  FROM `thaneesh-prod.raw.events`
  WHERE _PARTITIONTIME = TIMESTAMP(@target)
  GROUP BY user_id, session_id
) AS source
ON target.user_id = source.user_id AND target.session_id = source.session_id
WHEN MATCHED THEN UPDATE SET
  ended_at = source.ended_at,
  event_count = target.event_count + source.event_count
WHEN NOT MATCHED THEN INSERT (user_id, session_id, started_at, ended_at, event_count)
  VALUES (source.user_id, source.session_id, source.started_at, source.ended_at, source.event_count);

A reference table

Stack Cost profile Cold-start Best for
BigQuery scheduled queries Per-byte scanned None < 10 min jobs, ad-hoc reports
Cloud Composer (Airflow) Always-on VMs Slow (~minutes) Complex DAGs, long-running
Cloud Run + Pub/Sub Per request Fast (<1s) Event-driven, bursty
Dataflow streaming Per-vCPU-hour Slow Continuous, > 100 events/sec

Pick the cheapest one that fits the SLA. The "best practice" answer is often the most expensive — be skeptical.


Tiny configs that bite

A ~/.config/gcloud/configurations/config_default that quietly points at the wrong project will cost you an afternoon:

[core]
account = [email protected]
project = thaneesh-prod        # <-- this one
disable_usage_reporting = True

[compute]
region = asia-south1
zone   = asia-south1-a

Run gcloud config list every morning before doing anything destructive. The flag is --project=... if you want to override.


What I wish someone had told me

  1. Pipelines that "just work for years" are the most dangerous — they've never been tested under failure.
  2. Cost dashboards lie. The bill arrives 30 days late; alerts run at 1.5× and 2× of last month, not in absolute terms.
  3. The first question for any data quality issue should always be "is this fresh?", not "is this correct?".

Inline gotchas to remember

The _PARTITIONTIME column is a TIMESTAMP, not a DATE. Comparing it to CURRENT_DATE() returns NULL, not FALSE, and NULL is falsy in WHERE clauses → your filter silently passes nothing. Use DATE(_PARTITIONTIME) = CURRENT_DATE() or compare to a TIMESTAMP.


Further reading


If a pipeline can't be paused, rewound, and re-run from any point safely, it's not finished — it's just running. The rest of the job is making it boring.