A Philosophy of Software Design•Composition•Lesson 10 of 13

Data Transforms

Pure functions do one thing. Real programs do many things. The question is how you connect them.

The answer is boring, and that is the point: you call them in sequence. No special abstraction needed. Write small functions that take data in and return data out. Call them one after another. Use intermediate variables to name each stage.

Transform Stages

A transform stage is a pure function that takes data and returns transformed data. Each stage does one job.

function normalize(text: string): string {
  return text.trim().replace(/ {2,}/g, " ");
}

function lowercase(text: string): string {
  return text.toLowerCase();
}

function slugify(text: string): string {
  return text.replace(/\s+/g, "-");
}

Three functions. Each one is trivially testable. Each one is trivially readable. Now chain them:

const raw = "  Hello World  ";
const normalized = normalize(raw);
const lowered = lowercase(normalized);
const slug = slugify(lowered);
// "hello-world"

Each intermediate variable documents the state of the data at that point. You can inspect any of them in a debugger. You can log any of them. You can test any stage in isolation.

Compare this to the nested style:

const slug = slugify(lowercase(normalize("  Hello World  ")));

Same result. Reads inside-out. For three functions, tolerable. For eight, unreadable. Intermediate variables win on clarity.

Structured Data Transforms

Real transforms process structured data, not just strings. Each stage takes a shape in and returns a (possibly different) shape out.

interface RawRecord {
  readonly name: string;
  readonly email: string;
  readonly age: string;
  readonly role: string;
}

interface ValidatedUser {
  readonly name: string;
  readonly email: string;
  readonly age: number;
  readonly role: "admin" | "editor" | "viewer";
}

Build the stages:

// Stage 1: Normalize raw input
function normalizeRecords(records: readonly RawRecord[]): readonly RawRecord[] {
  return records.map(r => ({
    name: r.name.trim(),
    email: r.email.trim().toLowerCase(),
    age: r.age.trim(),
    role: r.role.trim().toLowerCase(),
  }));
}

// Stage 2: Remove incomplete entries
function removeIncomplete(records: readonly RawRecord[]): readonly RawRecord[] {
  return records.filter(r => r.name && r.email && r.age && r.role);
}

// Stage 3: Parse into domain types
function parseUsers(records: readonly RawRecord[]): readonly ValidatedUser[] {
  const validRoles = ["admin", "editor", "viewer"] as const;
  return records
    .map(r => ({
      name: r.name,
      email: r.email,
      age: parseInt(r.age, 10),
      role: r.role as ValidatedUser["role"],
    }))
    .filter(u => !isNaN(u.age) && validRoles.includes(u.role));
}

// Stage 4: Deduplicate
function deduplicateByEmail(users: readonly ValidatedUser[]): readonly ValidatedUser[] {
  const seen = new Set<string>();
  return users.filter(u => {
    if (seen.has(u.email)) return false;
    seen.add(u.email);
    return true;
  });
}

Four stages. Each independently testable. Now connect them:

function processUsers(records: readonly RawRecord[]): readonly ValidatedUser[] {
  const normalized = normalizeRecords(records);
  const complete = removeIncomplete(normalized);
  const parsed = parseUsers(complete);
  return deduplicateByEmail(parsed);
}

That is the entire orchestration. Four lines. Each line names what the data is at that point. The function reads top-to-bottom as a story: normalize, remove incomplete, parse, deduplicate.

No framework. No abstraction. Just functions.

Each Stage is a Deep Module

Look at deduplicateByEmail. Its interface is narrow: ValidatedUser[] in, ValidatedUser[] out. Its implementation could use a Set (as shown), a Map, a sorted array with binary search, or a Bloom filter. The caller does not know and does not care.

This is the deep module principle applied to functions. A well-named transform stage is a deep function: narrow interface, hidden complexity.

// Narrow interface, deep implementation
function enrichWithGeoData(users: readonly ValidatedUser[]): readonly EnrichedUser[] {
  // Inside: IP geolocation lookup, timezone inference, locale detection,
  // coordinate normalization, country code standardization.
  // The caller sees: ValidatedUser[] -> EnrichedUser[]
}

When Stages Need Different Shapes

Sometimes one stage produces data that the next stage needs in a different shape. Two approaches:

Approach 1: Return enriched objects.

Each stage passes through the original data plus its additions:

function parseHeaders(text: string): { text: string; headers: Header[] } {
  const headers = /* extract headers from text */;
  return { text, headers }; // pass text through for the next stage
}

function extractLinks(text: string): { text: string; links: Link[] } {
  const links = /* extract links from text */;
  return { text, links };
}

Approach 2: Orchestrate at the call site.

Let the caller decide what data each stage gets:

const normalized = normalizeWhitespace(rawText);
const { headers } = parseHeaders(normalized);
const { links } = extractLinks(normalized);
const toc = generateToc(headers);

Approach 2 is often clearer. The orchestration function shows the data flow explicitly. Each stage takes exactly the input it needs.

When Intermediate Variables Help

Intermediate variables are not noise. They are documentation. Use them when:

The data changes shape between stages (string to array, raw to validated)
You need to branch on intermediate results (check for errors, filter subsets)
The variable name adds meaning (calling it validOrders tells the reader something step2Result does not)

Skip them when the chain is short and the meaning is obvious:

// Fine without intermediate variables
const slug = slugify(lowercase(normalize(title)));

// Better with them
const rawOrders = await fetchOrders(date);
const validOrders = validateOrders(rawOrders);
const pricedOrders = applyPricing(validOrders, pricingConfig);
const discounted = applyDiscounts(pricedOrders, activePromotions);
await saveOrders(discounted);

The goal is readability. If a variable name makes the next line clearer, use it. If it is just temp1, you are not adding information.

Check Your Understanding

What is a transform stage?

Not quite. The correct answer is highlighted.

Why use intermediate variables between transform stages?

Not quite. The correct answer is highlighted.

How do transform stages relate to deep modules?

Not quite. The correct answer is highlighted.

Not quite.Expected: produce

Practice

Build a text-processing system using pure transform stages:

Summary

Transform stages are pure functions: data in, data out. Each stage does one job.
Chain stages with plain function calls and intermediate variables. No special abstraction needed.
Name intermediate variables after the state of the data at that point.
Each stage is a deep module: narrow interface, hidden implementation.
Name stages after what they produce, not how they work.
Orchestrate at the call site. The function that connects stages shows the full data flow in one place.

Next, we challenge a sacred cow: when DRY goes too far, and why AHA (Avoid Hasty Abstractions) is often the better default.

Data Transforms

Transform Stages

Structured Data Transforms

Each Stage is a Deep Module

When Stages Need Different Shapes

When Intermediate Variables Help

Check Your Understanding

Practice

Text Processing Pipeline

Summary