Catapulting into Causal Calculus [DRAFT]

How do you “do-”?
causal
graphs
statistics
draft
Author

Scott H. Hawley

Published

May 18, 2026

Work in Progress! Come back later for more

“…causal calculus differentiates between two types of conditional distributions one might want to estimate. …in ML we usually estimate only one of them, but in some applications we should actually…estimate the other one.” – Ferenc Huszar [1]

Intro

I’ve been wanting to learn Causal Inference (CI) / Causal Modeling (causal inference) “properly” ever since I asked for Judea Pearl’s bestselling The Book of Why [2] for Christmas in 2019. It’s a great read, but it hides any actual computation by merely mentioning the mystical “do calculus”. Pearl offers details on that in his textbook [3], but I want get up to speed faster, so I’ve been reading and watching other treatments [4] [5] [6] [7] [8] [9] [10], which most often are either “slow” or quickly throw up what I call the “Wall of Math.”1

In this post, we’re going to catapult straight over that wall — starting immediately with code and calculation, “do-”ing the “do-”calculus, introducing definitions and proofs only when we actually need them. We’re going to focus learning how to “do the do-”.

“Do the Do-” parody image (Source: ChatGPT5 – This and another humorous image are AI-generated, but all the prose is human-authored. Some of the code (e.g. for visualizations) was AI-generated.)

Case Study: Gear Acquisition Syndrome (GAS)

Causal modeling introductions invariably choose some medical intervention as an illustrative example. Since my specialty is teaching audio engineers [11], we’ll discuss the sad affliction known as Gear Acquisition Syndrome (GAS).

Slide from “Bayesian Analysis for Audio Engineers” [11]

GAS is a debilitating condition affecting millions of musicians worldwide. It is the (false) cognitive bias in which the musician thinks that buying gear will improve their playing, musicianship, performance, and/or overall “sound”.2 For this tutorial, we’ll use “sound” as a catch-all term, though we don’t mean to limit it to only tone quality.

NoteClarification

GAS is an in-joke among musicians. It is a fake condition that highlights our real tendency to delude ourselves into justifying “unnecessary” purchases. (And yes, I can tell you exactly what gear I “need” next.)

GAS Infographic - Thanks NanoBanana

As we build our example, we’ll see that there’s a strong correlation between having fancy gear and sounding great — but is that correlation actually causal?

Sample Data

I made a fictitious 16-person dataset that mixes famous artists with anonymous randos. The naive/GAS view looks only at the correlation between Gear and Sound:

Get dataset and display naive view
import pandas as pd

df = pd.read_csv('player_data.csv')

COLORS = {
    'Talent': {'High': '#4477AA', 'Low':  '#CC6600'},   # blue / orange
    'Gear':   {'Fancy': '#AA3377', 'Cheap': '#CC9900'}, # purple / yellow  
    'Sound':  {'Great': '#228833', 'Bad':  '#BB4455'},  # green / red-pink
}

def style_col(val, col):
    c = COLORS[col].get(val, '')
    return f'background-color: {c}; color: #EEEEEE' if c else ''

(df.drop(columns=['Talent']).style
   .applymap(style_col, subset=['Gear'],  col='Gear')
   .applymap(style_col, subset=['Sound'], col='Sound'))
  Name Gear Sound
0 Tom Scholz Fancy Great
1 Jeff Beck Fancy Bad
2 Jimi Fancy Great
3 Prince Fancy Great
4 Jenny Fancy Great
5 Jay Fancy Great
6 Bill Fancy Bad
7 Paris Hilton Fancy Bad
8 Milli Vanilli Cheap Great
9 Bob Dylan Cheap Bad
10 Dale Cheap Great
11 Todd Cheap Bad
12 Bubba Cheap Bad
13 Abe Cheap Bad
14 Earnest Cheap Bad
15 Mongo Cheap Bad


We’ll regard Sound as the “outcome”, the thing we measure via listening tests where listeners rate their preferences. We could imagine listeners submitting ranking scores and averaging them into a Mean Opinion Score (MOS), but for now it suffices to imagine (binary) A/B Testing for user preferences. Following causal inference conventions, we’ll denote the outcome as \(Y\) and use binary values:

  • \(Y=1\): Sounds Great
  • \(Y=0\): Sounds Bad (Technical term: “like ass”)

According this view, \(P(\rm{Sound=Great | Gear=Fancy})\) = 5/8 = 0.625. => Looks like better gear probably makes you sound better!

Does Talent Matter?

There’s a bit more to the dataset if we probe deeper. There’s also info on the players level of “Talent:

Code
(df.style
   .applymap(style_col, subset=['Talent'], col='Talent')
   .applymap(style_col, subset=['Gear'],   col='Gear')
   .applymap(style_col, subset=['Sound'],  col='Sound'))
  Name Talent Gear Sound
0 Tom Scholz High Fancy Great
1 Jeff Beck High Fancy Bad
2 Jimi High Fancy Great
3 Prince High Fancy Great
4 Jenny High Fancy Great
5 Jay High Fancy Great
6 Bill High Fancy Bad
7 Paris Hilton Low Fancy Bad
8 Milli Vanilli Low Cheap Great
9 Bob Dylan High Cheap Bad
10 Dale Low Cheap Great
11 Todd Low Cheap Bad
12 Bubba Low Cheap Bad
13 Abe Low Cheap Bad
14 Earnest Low Cheap Bad
15 Mongo Low Cheap Bad

Hmmmmm…. ?? Interesting. So somehow Talent is bound up with Sound too, and it’s also strongly correlated with Gear. In this case, Talent is known as a “confounder”, and we’ll explore its influence as we go on in the tutorial.

But first, let’s build a little machinery so we can work with graphical represenations of causal models.

Defining and Visualizing Graphs

First we need to be able to define and take a look at the graphs that will comprise our causal models. We’ll set up the graph with Python (nodes, edges, directions) then push it to the viz routine so we can see it and move it around. I can’t articulate my intended applications, but based on my interests, I’m use a few libraries for this:

  1. NetworkX for general graph bookkeeping (PyG is overkill for today.)
  2. D3.js for interactive visualizations with force-directed nodes.
  3. Pandas for handling tables of data
  4. PyTorch for other sundry numerical calculations, since ultimately I want to do “machine learning stuff”

We won’t use causal inference-specific libraries like CausalML or EconML, …, etc. because I suspect they would likely hide and/or automate the bits I hope to learn to implement.

Installing Packages
!uv pip install networkx matplotlib torch torchvision --quiet

Let’s define a small graph or two with NetworkX:

import networkx as nx
import copy

# Gear Acquisition Syndrome
G = nx.DiGraph()
G.add_node("Gear",   role="treatment")
G.add_node("Sound",  role="outcome")
G.add_edges_from([("Gear", "Sound"),])

# "Reality"
G2 = copy.deepcopy(G)
G2.add_node("Talent", role="confounder")
G2.add_edges_from([
    ("Talent",   "Sound"),
    ("Talent",   "Gear",),
])

And then visualize with D3.js – try dragging the nodes around!

Visualization utility code — feel free to ignore
# Did I let Claude generate this code cell? ABSOLUTELY
import json, itertools
from IPython.display import HTML

ROLE_COLORS = {
    "treatment":  "#c0392b",  # dark red
    "outcome":    "#1e8449",  # dark green
    "confounder": "#6c3483",  # dark purple
    "default":    "#2471a3",  # dark blue
}
_dag_counter = itertools.count()

def show_dag(G, width=600, height=300, node_size=34, font_size=13, title=""):
    if not isinstance(G, list):
        G, title = [G], [title]
    elif not isinstance(title, list):
        title = [title] * len(G)

    specs = []
    divs  = []
    for g, t in zip(G, title):
        nodes = [{"id": str(n), "color": ROLE_COLORS.get(d.get("role"), ROLE_COLORS["default"])}
                 for n, d in g.nodes(data=True)]
        links = [{"source": str(u), "target": str(v), "label": d.get("label", "")}
                 for u, v, d in g.edges(data=True)]
        uid = f"dag{next(_dag_counter)}"
        specs.append({"uid": uid, "nodes": nodes, "links": links,
                      "title": t, "titleOffset": 28 if t else 0})
        divs.append(f'<div id="{uid}" style="width:{width}px;height:{height}px;'
                    f'border:1px solid #444;border-radius:6px;background:#1a1a2e;display:inline-block;"></div>')

    specs_json = json.dumps(specs)
    container  = '<div style="display:flex;gap:12px;flex-wrap:wrap;">' + "".join(divs) + '</div>'

    script = f"""
<script type="module">
import * as d3 from "https://cdn.jsdelivr.net/npm/d3@7/+esm";

function initGraphs() {{
  const specs = {specs_json};
  const W = {width}, H = {height}, R = {node_size}, FS = {font_size};
  console.log("initGraphs: found", specs.length, "specs");

  for (const spec of specs) {{
    const {{uid, nodes, links, title, titleOffset}} = spec;
    const el = document.getElementById(uid);
    console.log("Looking for #" + uid + ":", el);
    if (!el) continue;

    const w = W, h = H, r = R, fs = FS;
    const svg = d3.select(el).append("svg").attr("width", w).attr("height", h);

    if (title) {{
      svg.append("text").attr("x", w/2).attr("y", 30)  // title placement
        .attr("text-anchor", "middle").attr("fill", "#ccc")
        .attr("font-size", "18px").attr("font-family", "sans-serif").attr("font-weight", "bold")
        .text(title);
    }}

    svg.append("defs").append("marker")
      .attr("id", uid + "-arrow").attr("viewBox", "0 -5 10 10")
      .attr("refX", 10).attr("refY", 0).attr("markerWidth", 7).attr("markerHeight", 7).attr("orient", "auto")
      .append("path").attr("d", "M0,-5L10,0L0,5").attr("fill", "#aaa");

    const linkData = links.map(l => ({{...l}}));
    const nodeData = nodes.map(n => ({{...n}}));

    const sim = d3.forceSimulation(nodeData)
      .force("link", d3.forceLink(linkData).id(d => d.id).distance(140))
      .force("charge", d3.forceManyBody().strength(-500))
      .force("center", d3.forceCenter(w/2, titleOffset/2 + h/2));

    const link = svg.append("g").selectAll("line").data(linkData).join("line")
      .attr("stroke", "#aaa").attr("stroke-width", 2)
      .attr("marker-end", "url(#" + uid + "-arrow)");

    const edgeLabels = svg.append("g").selectAll("text").data(linkData.filter(d => d.label)).join("text")
      .attr("text-anchor", "middle").attr("fill", "#ffdd57")
      .attr("font-size", "16px").attr("font-family", "sans-serif").attr("font-weight", "bold")
      .text(d => d.label);

    const node = svg.append("g").selectAll("g").data(nodeData).join("g")
      .call(d3.drag()
        .on("start", (e,d) => {{ if (!e.active) sim.alphaTarget(0.3).restart(); d.fx=d.x; d.fy=d.y; }})
        .on("drag",  (e,d) => {{ d.fx=e.x; d.fy=e.y; }})
        .on("end",   (e,d) => {{ if (!e.active) sim.alphaTarget(0); d.fx=null; d.fy=null; }}));

    node.append("circle").attr("r", r).attr("fill", d => d.color).attr("stroke", "#fff").attr("stroke-width", 1.5);

    node.each(function(d) {{
      const lines = d.id.split("\\n");
      const txt = d3.select(this).append("text").attr("text-anchor", "middle")
        .attr("fill", "white").attr("font-size", fs+"px").attr("font-family", "sans-serif");
      const lh = fs * 1.2, yStart = -((lines.length - 1) * lh) / 2;
      lines.forEach((line, i) => {{
        txt.append("tspan").attr("x", 0).attr("y", yStart + i*lh).attr("dy", "0.35em").text(line);
      }});
    }});

    sim.on("tick", () => {{
      link.each(function(d) {{
        const dx = d.target.x-d.source.x, dy = d.target.y-d.source.y;
        const dist = Math.sqrt(dx*dx+dy*dy) || 1, ux = dx/dist, uy = dy/dist;
        d3.select(this)
          .attr("x1", d.source.x+ux*r).attr("y1", d.source.y+uy*r)
          .attr("x2", d.target.x-ux*r).attr("y2", d.target.y-uy*r);
      }});
      edgeLabels
        .attr("x", d => (d.source.x+d.target.x)/2)
        .attr("y", d => (d.source.y+d.target.y)/2 - 8);
      node.attr("transform", d => `translate(${{d.x}},${{d.y}})`);
    }});
  }}
}}

setTimeout(initGraphs, 100);
</script>
"""
    return HTML(container + script)

show_dag([G, G2], width=330, height=320, title=["Gear Acquisition Syndrome", "Reality"])

The Key Question as Math

In terms of causal inference., we want to compute the probability \(P({\rm Sound=Great} | \rm{do(Gear=Fancy}))\), i.e., will the sound will improve if we “intervene” by upgrading the gear?

To do that, we need to “marginalize” over Talent – i.e. do a weighted sum over all subsets. And to keep our notation compact, let’s define \(X\) = Gear (1=Fancy), \(Y\) = Sound (1=Great), \(Z\) = Talent (0=Low, 1=High). Then we want: \[\begin{eqnarray}P(Y=1 \mid do(X=1)) &=& P(Y=1 \mid X=1, Z=1)\cdot P(Z=1)\\ &+& P(Y=1 \mid X=1, Z=0)\cdot P(Z=0)\end{eqnarray}\] which in general form is called the “backdoor adjustment formula” to control for confounder \(Z\) (which in our case Talent):

\[P(Y \mid do(X)) = \sum_z P(Y \mid X, Z=z)\cdot P(Z=z)\]

Tip

The basic idea of the do-calculus is to suggest a way to compute the results of the (counterfactual) intervention using observed quantities. In so doing, we recover the values we “would have gotten” had we conducted a Randomized Controlled Trial (RCT).

In code, that calculation can look like:

p_talent_low  = (df["Talent"] == "Low").mean()
p_talent_high = (df["Talent"] == "High").mean()
print(f"P(Z=0)     = {p_talent_low:.3f},  P(Z=1)     = {p_talent_high:.3f}")

subset_high = df.query("Gear == 'Fancy' and Talent == 'High'")
subset_low  = df.query("Gear == 'Fancy' and Talent == 'Low'")

p_sound_given_low  = (subset_low["Sound"]  == "Great").mean()
p_sound_given_high = (subset_high["Sound"] == "Great").mean()
print(f"P(Y | Z=0) = {p_sound_given_low:.3f},  P(Y | Z=1) = {p_sound_given_high:.3f}")

p_naive = (df[df["Gear"] == "Fancy"]["Sound"] == "Great").mean()
print(f"Naive estimate (GAS)            = P(Y=1 | X=1)     = {p_naive:.3f}")

p_do_gear_fancy = (p_sound_given_high * p_talent_high) + (p_sound_given_low * p_talent_low)
print(f"Adjusted for Talent (Reality)   = P(Y=1 | do(X=1)) = {p_do_gear_fancy:.3f}")
P(Z=0)     = 0.500,  P(Z=1)     = 0.500
P(Y | Z=0) = 0.000,  P(Y | Z=1) = 0.714
Naive estimate (GAS)            = P(Y=1 | X=1)     = 0.625
Adjusted for Talent (Reality)   = P(Y=1 | do(X=1)) = 0.357

….So, at least judging by by this dataset, once we accounted for Talent, buying new gear yields less than a coin flip’s chance of improving someone’s sound. (Sweetwater, don’t sue! This is made up data. Feel free to provide real market data and I’ll work it in.)

That calculation was fine, but we can “do-” the computation (much) more efficiently via “advanced” pandas moves. For future reference, we’ll include such a code now, collapsed by default, and confirm it yields the same results as before.

Efficient backdoor adjustment code (Click to expand)
def backdoor_adjustment(data, outcome_col, treatment_col, confounder_col, outcome_val, treatment_val):
    """
    Computes P(Outcome = outcome_val | do(Treatment = treatment_val)) 
    by adjusting for a single confounding variable.
    """
    # 1. Calculate P(Confounder) for all categories
    p_confounder = data[confounder_col].value_counts(normalize=True)
    
    # 2. Filter data for the specific treatment strategy
    treatment_subset = data[data[treatment_col] == treatment_val]
    
    # 3. Calculate P(Outcome = outcome_val | Treatment, Confounder) for all categories
    p_outcome_given_confounder = (
        treatment_subset[outcome_col].eq(outcome_val)
        .groupby(treatment_subset[confounder_col])
        .mean() )
    
    # 4. Align indices, multiply, and sum (the adjustment formula)
    return (p_outcome_given_confounder * p_confounder).sum()
    
p_naive = (df[df["Gear"] == "Fancy"]["Sound"] == "Great").mean()
print(f"Naive estimate (GAS)            = P(Y=1 | X=1)     = {p_naive:.3f}")

result = backdoor_adjustment(df, "Sound", "Gear", "Talent", "Great", "Fancy")
print(f"Adjusted for Talent (Reality)   = P(Y=1 | do(X=1)) = {result:.3f}")
Naive estimate (GAS)            = P(Y=1 | X=1)     = 0.625
Adjusted for Talent (Reality)   = P(Y=1 | do(X=1)) = 0.357

More Adjustments

The full do-calculus has adjustment rules for a few different other kinds of subgraph configurations:

  • “front door” whereby The thing you’re interested in has all of its causal influence passing through an intermediary before reaching the effect.
  • “colliders”, Where two (or more) arrows point inward to a node.
  • ….something else…instrumental variables? #TODO

I’m not going to bother enumerating every single one because I want to move on to what I’m really interested in – time dependence.

What About Time Dependence?

I’m interested in time-dependent audio signals. Similar data streams could be weather patterns, logistics data, or financial data. Now, we could naively just let the time \(t\) be another variable like \(X\), \(Y\), and \(Z\), but this is typically not how it’s done – and smart people have already thought through this.[[12]][13]

Historically, Clive Granger proposed a causal method for economics time series in 1969 [14]. It amounts to creating a separate graph node for each instant of time (or at least for each time we have data), i.e., not just a function \(X(t)\) but rather the past-directed sequence of data points \(X_t\), \(X_{t-1}\), \(X_{t-2}\),\(\ldots\)

Granger Causality example. Signal Y is “Granger-caused” by signal X if changes in Y consistently follow changes in X. So that knowing X improves your forecasts of Y. (Source: Wikipedia)
Warning

Ordinary Granger “causality” isn’t really “true” causality, rather it is just a measure of time-delayed predictive correlation. In order to explore do-calculus modifications to it, we’ll have to do some work.

Typically, we may have a lot of different data streams, rather than enumerating X, Y, etc. we’ll just use X with superscripts and subscripts denoting the \(j\)th data stream at time \(t\) as \(X^j_t\). The relationships between these streams as we unroll each node at each time can either be instantaneous or have time delays as shown in this image from [13]:

DAGs showing multichannel time series without instantaneous effects (top) and with them (bottom). Source: [13]

Advanced Topics

Code
G3 = nx.DiGraph()
G3.add_node("Talent",      role="default")
G3.add_node("Performance", role="confounder")
G3.add_node("Gear",        role="default")
G3.add_node("Recording",   role="confounder")
G3.add_node("Signal\nProc", role="treatment")
G3.add_node("Sound",       role="outcome")

G3.add_edges_from([
    ("Talent",      "Performance"),
    ("Performance", "Recording"),
    ("Gear",        "Recording"),
    ("Recording",   "Signal\nProc"),
    ("Signal\nProc", "Sound"),
])

show_dag(G3, width=800, height=600, title="Why does my recording sound like ass?")

Work in Progress

Still writing! Come back for more later.

Glossary

This is to keep my head straight while the jargon piles up.

Term Definition
collider something
confounder something
minimality something

References

[1]
F. Huszár, “ML beyond curve fitting: An intro to causal inference and do-calculus.” https://www.inference.vc/untitled/, May 2018.
[2]
J. Pearl and D. Mackenzie, The Book of Why: The new science of cause and effect. New York: Basic Books, 2018.
[3]
J. Pearl, Causality: Models, reasoning, and inference, 2nd ed. Cambridge, UK: Cambridge University Press, 2009.
[4]
R. R. Tucci, “Introduction to Judea Pearl’s do-calculus,” CoRR, vol. abs/1305.5506, 2013, Available: http://arxiv.org/abs/1305.5506
[5]
B. Neal, Introduction to Causal Inference (from a machine learning perspective). 2020. Available: https://www.bradyneal.com/causal-inference-course
[6]
M. Facure, Causal inference in Python. O’Reilly Media, 2023. Available: https://www.oreilly.com/library/view/causal-inference-in/9781098140243/
[7]
A. Ghosh, A. Roy, and D. Herremans, “KARMA-MV: A benchmark for causal question answering on music videos.” 2026. Available: https://arxiv.org/abs/2605.08175
[8]
A. (Ari) Bornstein, “3 PyTorch lightning community causal inference examples to inspire your next project.” https://devblog.pytorchlightning.ai/3-pytorch-lightning-community-causal-inference-examples-to-inspire-your-next-project-da6f45c07b70, Mar. 16, 2021.
[9]
J. Krohn, “SDS 909: Causal AI, with dr. Robert osazuwa ness.” https://www.superdatascience.com/podcast/sds-909-causal-ai-with-dr-robert-usazuwa-ness, Jul. 29, 2025.
[10]
B. J. Koch, T. Sainburg, P. G. Bastías, S. Jiang, Y. Sun, and J. G. Foster, “A primer on deep learning for causal inference,” Sociological Methods & Research, vol. 54, no. 2, pp. 397–447, May 2025, doi: 10.1177/00491241241234866.
[11]
Bayesian analysis for audio engineers.” Meeting of the Belmont Student Section of the Audio Engineering Society, 2014. Available: https://hedges.belmont.edu/BayesianAnalysisPartI.pdf
[12]
D. Danks and S. Plis, Learning Causal Structure from Undersampled Time Series,” in Advances in Neural Information Processing Systems, 2013. doi: 10.1184/R1/6492101.v1.
[13]
J. Peters, D. Janzing, and B. Schölkopf, Elements of causal inference: Foundations and learning algorithms. in Adaptive computation and machine learning. The MIT Press, 2017, pp. 197–211.
[14]
C. W. J. Granger, “Investigating causal relations by econometric models and cross-spectral methods,” Econometrica, vol. 37, no. 3, pp. 424–438, 1969, doi: 10.2307/1912791.

  1. 2026 Scott H. Hawley

Footnotes

  1. Specifically: a Slough of Despond that front-loads definitions and jargon before you can do anything useful. In physics education I call that the “Chapter 2 Math Dump”; in ML it’s the “Wall of Math”. For causal inference it’s not even equations, it’s seemingly-endless terminology that induces cognitive load while you’re anxiously wondering “But how can I calculate something?” These treatments also invariably claim to assume mere “basic familiarity with statistics” yet invoke laws I’ve never heard of. Seems “basic” is relative.↩︎

  2. Other definitions of GAS – e.g., “buying more gear will help me make better music” – are isomorphic to the one we have adopted. Predatory companies like Sweetwater and Guitar Center capitalize on the pathetic delusions of desperate losers and our lust for shiny toys. /s↩︎