<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
<channel>
  <title>Thaneesh Reddy — Blog</title>
  <link>https://blog.thaneesh.in/</link>
  <atom:link href="https://blog.thaneesh.in/feed.xml" rel="self" type="application/rss+xml" />
  <description>Notes and write-ups from Thaneesh Reddy on data engineering, GCP, and AWS.</description>
  <language>en-us</language>
  <lastBuildDate>Fri, 15 May 2026 17:23:59 GMT</lastBuildDate>
  <item>
    <title>test post</title>
    <link>https://blog.thaneesh.in/posts/test-post</link>
    <guid isPermaLink="true">https://blog.thaneesh.in/posts/test-post</guid>
    <pubDate>Fri, 15 May 2026 17:23:59 GMT</pubDate>
    <description>A short field guide to surviving your first six months on a data team — the things that took me longer than they should have. Skim the headings, jump to whichever section is on fire today.</description>
    <content:encoded><![CDATA[<p>A short field guide to surviving your first six months on a data team — the things that took me longer than they should have. Skim the headings, jump to whichever section is on fire today.</p>
<h2>The pipeline you inherit</h2>
<p>Every pipeline you inherit was written <strong>under deadline pressure</strong> by someone who has since left the team. Read the code; don&#39;t trust the wiki. The wiki is a <a href="https://en.wikipedia.org/wiki/Documentation">snapshot of intent</a>, not behavior.</p>
<p>Three things are almost always wrong:</p>
<ol>
<li>The schedule (Airflow cron lies about timezones)</li>
<li>The retries (silent failures aren&#39;t failures)</li>
<li>The owner (the listed owner left two reorgs ago)</li>
</ol>
<blockquote>
<p>A pipeline you can&#39;t reason about in 10 minutes is a pipeline that owns <em>you</em>, not the other way around.</p>
</blockquote>
<h3>What I check first</h3>
<ul>
<li><input checked="" disabled="" type="checkbox"> Source freshness — is the upstream actually publishing?</li>
<li><input checked="" disabled="" type="checkbox"> Watermark / cursor — where did the last successful run stop?</li>
<li><input disabled="" type="checkbox"> DAG run history — look for the <em>pattern</em> of failures, not the latest one</li>
<li><input disabled="" type="checkbox"> Cost — a pipeline that doubles in cost overnight is leaking something</li>
</ul>
<hr>
<h2>Code that ships vs. code that runs</h2>
<p>Most pipeline bugs aren&#39;t logic errors. They&#39;re <strong>type coercions</strong> in places nobody expected. Below: the same idea in three languages, ranked by how easy it is to get wrong.</p>
<h3>Bash — easy to get very wrong</h3>
<pre><code class="language-bash"># Fail loudly on the first error; print every command.
set -euo pipefail

INPUT_DATE=&quot;${1:-$(date -u +%F)}&quot;
BQ_PROJECT=&quot;thaneesh-prod&quot;

bq query \
  --use_legacy_sql=false \
  --destination_table &quot;${BQ_PROJECT}:staging.events_${INPUT_DATE//-/}&quot; \
  &quot;SELECT * FROM \`${BQ_PROJECT}.raw.events\`
   WHERE _PARTITIONTIME = TIMESTAMP(&#39;${INPUT_DATE}&#39;)&quot;
</code></pre>
<p>The <code>set -euo pipefail</code> line is <strong>the most important line in any bash file you&#39;ll ever ship.</strong> Skip it and silent failures will eat your weekend.</p>
<h3>Python — the safe default for most pipelines</h3>
<pre><code class="language-python">from datetime import date, timedelta
from google.cloud import bigquery

client = bigquery.Client(project=&quot;thaneesh-prod&quot;)

def backfill(target: date, lookback_days: int = 7) -&gt; int:
    &quot;&quot;&quot;Re-materialize the last N days. Returns rows written.&quot;&quot;&quot;
    start = target - timedelta(days=lookback_days)
    job = client.query(
        f&quot;&quot;&quot;
        SELECT * FROM `thaneesh-prod.raw.events`
        WHERE _PARTITIONTIME BETWEEN
          TIMESTAMP(&#39;{start.isoformat()}&#39;) AND TIMESTAMP(&#39;{target.isoformat()}&#39;)
        &quot;&quot;&quot;,
        job_config=bigquery.QueryJobConfig(
            destination=f&quot;thaneesh-prod.staging.events_{target:%Y%m%d}&quot;,
            write_disposition=&quot;WRITE_TRUNCATE&quot;,
        ),
    )
    result = job.result()
    return result.total_rows or 0
</code></pre>
<h3>TypeScript — when the orchestrator lives in Node</h3>
<pre><code class="language-ts">import { BigQuery } from &quot;@google-cloud/bigquery&quot;;

const bq = new BigQuery({ projectId: &quot;thaneesh-prod&quot; });

export async function backfill(target: Date, lookbackDays = 7): Promise&lt;number&gt; {
  const start = new Date(target);
  start.setUTCDate(start.getUTCDate() - lookbackDays);

  const [job] = await bq.createQueryJob({
    query: `
      SELECT * FROM \`thaneesh-prod.raw.events\`
      WHERE _PARTITIONTIME BETWEEN
        TIMESTAMP(@start) AND TIMESTAMP(@target)
    `,
    params: { start: start.toISOString(), target: target.toISOString() },
    destination: bq.dataset(&quot;staging&quot;).table(`events_${ymd(target)}`),
    writeDisposition: &quot;WRITE_TRUNCATE&quot;,
  });

  const [rows] = await job.getQueryResults();
  return rows.length;
}

const ymd = (d: Date) =&gt; d.toISOString().slice(0, 10).replace(/-/g, &quot;&quot;);
</code></pre>
<h3>SQL — what actually runs on the warehouse</h3>
<pre><code class="language-sql">MERGE `thaneesh-prod.warehouse.user_sessions` AS target
USING (
  SELECT
    user_id,
    session_id,
    MIN(event_ts) AS started_at,
    MAX(event_ts) AS ended_at,
    COUNT(*) AS event_count
  FROM `thaneesh-prod.raw.events`
  WHERE _PARTITIONTIME = TIMESTAMP(@target)
  GROUP BY user_id, session_id
) AS source
ON target.user_id = source.user_id AND target.session_id = source.session_id
WHEN MATCHED THEN UPDATE SET
  ended_at = source.ended_at,
  event_count = target.event_count + source.event_count
WHEN NOT MATCHED THEN INSERT (user_id, session_id, started_at, ended_at, event_count)
  VALUES (source.user_id, source.session_id, source.started_at, source.ended_at, source.event_count);
</code></pre>
<hr>
<h2>A reference table</h2>
<table>
<thead>
<tr>
<th>Stack</th>
<th>Cost profile</th>
<th>Cold-start</th>
<th>Best for</th>
</tr>
</thead>
<tbody><tr>
<td><strong>BigQuery</strong> scheduled queries</td>
<td>Per-byte scanned</td>
<td>None</td>
<td>&lt; 10 min jobs, ad-hoc reports</td>
</tr>
<tr>
<td><strong>Cloud Composer</strong> (Airflow)</td>
<td>Always-on VMs</td>
<td>Slow (~minutes)</td>
<td>Complex DAGs, long-running</td>
</tr>
<tr>
<td><strong>Cloud Run</strong> + Pub/Sub</td>
<td>Per request</td>
<td>Fast (&lt;1s)</td>
<td>Event-driven, bursty</td>
</tr>
<tr>
<td><strong>Dataflow</strong> streaming</td>
<td>Per-vCPU-hour</td>
<td>Slow</td>
<td>Continuous, &gt; 100 events/sec</td>
</tr>
</tbody></table>
<p>Pick the cheapest one that fits the SLA. The &quot;best practice&quot; answer is often the <em>most expensive</em> — be skeptical.</p>
<hr>
<h2>Tiny configs that bite</h2>
<p>A <code>~/.config/gcloud/configurations/config_default</code> that quietly points at the wrong project will cost you an afternoon:</p>
<pre><code class="language-ini">[core]
account = you@example.com
project = thaneesh-prod        # &lt;-- this one
disable_usage_reporting = True

[compute]
region = asia-south1
zone   = asia-south1-a
</code></pre>
<p>Run <code>gcloud config list</code> <del>every morning</del> <em>before doing anything destructive</em>. The flag is <code>--project=...</code> if you want to override.</p>
<hr>
<h2>What I wish someone had told me</h2>
<ol>
<li>Pipelines that &quot;just work for years&quot; are the most dangerous — they&#39;ve never been <strong>tested under failure</strong>.</li>
<li>Cost dashboards lie. The bill arrives 30 days late; alerts run at 1.5× and 2× of last month, not in absolute terms.</li>
<li>The first question for any data quality issue should always be <code>&quot;is this fresh?&quot;</code>, not <code>&quot;is this correct?&quot;</code>.</li>
</ol>
<h3>Inline gotchas to remember</h3>
<p>The <code>_PARTITIONTIME</code> column is a <code>TIMESTAMP</code>, not a <code>DATE</code>. Comparing it to <code>CURRENT_DATE()</code> returns <code>NULL</code>, not <code>FALSE</code>, and <code>NULL</code> is falsy in <code>WHERE</code> clauses → your filter silently passes nothing. Use <code>DATE(_PARTITIONTIME) = CURRENT_DATE()</code> or compare to a <code>TIMESTAMP</code>.</p>
<hr>
<h2>Further reading</h2>
<ul>
<li>The <a href="https://cloud.google.com/architecture">GCP architecture center</a> — official, sometimes dated, always worth a skim</li>
<li><em>Designing Data-Intensive Applications</em> by Martin Kleppmann — the textbook</li>
<li>This <a href="https://en.wikipedia.org/wiki/Idempotence">post on idempotent pipelines</a> — same idea, deeper</li>
</ul>
<hr>
<p>If a pipeline can&#39;t be paused, rewound, and re-run from any point safely, it&#39;s not finished — it&#39;s just running. The rest of the job is making it boring.</p>
]]></content:encoded>
  </item>
  <item>
    <title>First Post</title>
    <link>https://blog.thaneesh.in/posts/first-post</link>
    <guid isPermaLink="true">https://blog.thaneesh.in/posts/first-post</guid>
    <pubDate>Fri, 15 May 2026 15:05:24 GMT</pubDate>
    <description>first post</description>
    <content:encoded><![CDATA[<h1>first post</h1>
]]></content:encoded>
  </item>
</channel>
</rss>
