Building a Custom Markdown Pipeline with Rich Embeds

This post documents the custom markdown processing pipeline that powers this site. It handles LaTeX math, interactive galleries, link previews, and academic citations while maintaining plain .md files for portability and simple authoring. If you're building a content site and weighing MDX vs. custom markdown processing, this deep-dive covers the architecture, challenges, and trade-offs.

Why Plain Markdown Over MDX

MDX is powerful, but for content-heavy sites, it introduces unnecessary complexity. After evaluating both approaches for this site, we chose plain markdown with custom processing for several practical reasons.

MDX's downsides for content authoring:

Build fragility: A single syntax error breaks the entire build, blocking publication
Nondeterminism: Dynamic components can produce different output across environments
Collaboration barrier: Writers need React knowledge to add components
Noisy diffs: JSX mixed with prose makes content reviews difficult
Tooling overhead: Editors, linters, and preview tools all need MDX-specific support

Plain markdown's advantages:

Deterministic builds: Same input → same output, every time
Universal authoring: Anyone who knows markdown can contribute
Clean diffs: Code review focuses on content changes, not component logic
Portability: Content works with any markdown processor (no vendor lock-in)

The key insight: we only need a small set of rich embeds (galleries, link previews, math), which can be expressed declaratively without full React component power. Custom preprocessing handles these special cases while keeping 95% of the content as pure markdown.

Architecture

The pipeline splits work between build-time (static HTML generation) and runtime (client-side interactivity). Here's the complete flow:

BUILD TIME (Server):
Markdown file
  ↓ Front-matter extraction
  ↓ Citation pre-processing
  ↓ Placeholder protection
  ↓ Unified: remark-parse → remark-gfm → remark-heading-ids → 
     remark-rehype → rehype-highlight → rehype-stringify
  ↓ Placeholder restoration
  ↓ Post-processing
  ↓ Static HTML output
  
RUNTIME (Client):
  ↓ Parse and hydrate galleries (JSON parse, event handlers)
  ↓ Fetch and render link previews (API calls, skeleton → card)
  ↓ Typeset math (MathJax)
  ↓ Initialize TOC (Tocbot)

Build-time processing generates static HTML for fast initial loads. Client-side hydration adds interactivity without blocking render. This split provides:

Fast page loads (static HTML)
SEO-friendly content (crawlers see everything)
Progressive enhancement (rich features load incrementally)

Three key build-time stages:

Pre-processing: Protect special content with placeholders
Custom plugins: Heading IDs and citations
Post-processing: Restore content with proper escaping

The Placeholder System

The core challenge in processing markdown with rich content is protecting fragile syntax from being mangled by the parser. LaTeX expressions, HTML comments, and structured data blocks all contain characters that markdown processors interpret as special syntax.

The problem:

Example (raw markdown):

$$f(x) = \sum_{i=1}^{n} x_i < \alpha$$

Example (rendered):

f(x) = \sum_{i=1}^{n} x_i < \alpha

< gets interpreted as HTML tag
Characters get escaped incorrectly
Expression gets wrapped in <p> tags

Solution: Extract → Store → Replace → Process → Restore

Note: Snippets use EXAMPLE_* placeholder names so this article doesn't match the real restoration regexes. The production pipeline prefixes placeholders with tokens like MATH_PLACEHOLDER_*_END.

const protectedBlocks: string[] = []

// Protect display math
protectedContent = protectedContent.replace(/\$\$([\s\S]*?)\$\$/g, (match) => {
  protectedBlocks.push(match)
  return `EXAMPLE_MATH_PH_${protectedBlocks.length - 1}_END`
})

// Protect inline math
protectedContent = protectedContent.replace(/(?<!\$)\$(?!\$)(.*?)\$/g, (match) => {
  protectedBlocks.push(match)
  return `EXAMPLE_MATH_PH_${protectedBlocks.length - 1}_END`
})

// Protect galleries (with validation)
protectedContent = protectedContent.replace(/:::example-gallery[\s\S]*?:::/gi, (match) => {
  const validation = validateGalleryBlock(match.trim())
  protectedBlocks.push(validation.isValid ? match : generateErrorHTML(validation.error))
  return `EXAMPLE_GALLERY_PH_${protectedBlocks.length - 1}_END`
})

// Process through Unified...

// Restore with proper escaping
protectedBlocks.forEach((block, index) => {
  const placeholder = `EXAMPLE_MATH_PH_${index}_END`
  if (html.includes(placeholder)) {
    const escaped = block.replace(/</g, '&lt;').replace(/>/g, '&gt;')
    html = html.replace(placeholder, 
      /^\s*\$\$/.test(block) ? `<div class="math-display">${escaped}</div>` : escaped
    )
  }
})

Placeholder strings are unique, alphanumeric (so they survive markdown processing intact), and indexed (to support multiple instances).

Rich Embeds: Galleries and Link Previews

Beyond text and math, modern content sites need interactive galleries and rich link previews. Rather than requiring authors to write React components, we use a declarative triple-colon syntax that looks like markdown but triggers custom processing.

Note: The code examples below use :::example-* markers to prevent this documentation from being processed by the pipeline.

Gallery syntax:

:::example-gallery
images=[
  {"src": "/images/photo1.jpg", "alt": "Sunset", "caption": "7000ft"},
  {"src": "/images/photo2.jpg", "alt": "Lake"}
]
:::

Build-time validation:

function validateGalleryBlock(block: string) {
  if (!block.includes('images=')) return { isValid: false, error: 'Missing images array' }
  const match = block.match(/images=\[([\s\S]*?)\]/)
  if (!match?.[1]?.trim()) return { isValid: false, error: 'Empty images array' }
  return { isValid: true }
}

Invalid blocks render as red error callouts with helpful debugging information.

Client hydration:

const processExplicitGalleryMarkers = (container: HTMLElement) => {
  const walker = document.createTreeWalker(container, NodeFilter.SHOW_TEXT)
  
  // Find all :::example-gallery blocks
  const blocks = []
  let node
  while (node = walker.nextNode()) {
    if ((node as Text).textContent?.includes(':::example-gallery')) blocks.push(node)
  }
  
  blocks.forEach(textNode => {
    const match = textNode.textContent.match(/:::example-gallery\s*([\s\S]*?)\s*:::/i)
    if (!match) return
    
    const images = JSON.parse(match[1].match(/images\s*=\s*(\[[\s\S]*?\])/i)[1].replace(/'/g, '"'))
    
    const gallery = document.createElement('div')
    gallery.className = `image-gallery image-gallery-${images.length}`
    
    images.forEach((img, idx) => {
      const el = document.createElement('img')
      el.src = img.src
      el.alt = img.alt || ''
      el.style.cursor = 'pointer'
      el.onclick = () => openLightbox(images, idx)
      gallery.appendChild(el)
    })
    
    textNode.parentElement?.parentNode?.replaceChild(gallery, textNode.parentElement)
  })
}

Lightbox provides keyboard navigation (arrows, Escape), scroll locking, and touch-friendly controls.

Link preview syntax:

:::example-link-preview
url="https://example.com"
title="Optional"
description="Optional"
:::

If metadata provided, render immediately. Otherwise fetch from /api/link-preview (Open Graph scraper with image scoring heuristic).

Client-Side Hydration

The build process generates static HTML for fast page loads, but interactive features (galleries, link previews, math) need client-side JavaScript to become functional. The key is hydrating efficiently—running transformations once and avoiding duplicate processing.

Build time (server):

Parse markdown and validate all syntax
Generate heading IDs for anchor links
Output static HTML with placeholder markers

Render time (client):

Hydrate gallery markers → interactive image grids
Hydrate link preview markers → rich cards (with API fetching)
Typeset all math expressions via MathJax
Initialize floating TOC with active state tracking

Critical: Single-pass DOM walk to avoid double hydration:

const processContent = useCallback((container: HTMLElement) => {
  processExplicitGalleryMarkers(container)
  processExplicitLinkPreviews(container)

  // Auto-detect multi-image paragraphs
  container.querySelectorAll('p').forEach(p => {
    const images = p.querySelectorAll('img')
    if (images.length >= 2) {
      p.classList.add('image-gallery', `image-gallery-${images.length}`)
      // Add handlers...
    }
  })
}, [])

useEffect(() => {
  contentRef.current.innerHTML = html
  processContent(contentRef.current)  // Single call
}, [html, processContent])

Validation and Error Handling

Rather than failing the build on syntax errors, we render visual error messages in the output. This lets authors see exactly what's wrong and where, while still allowing the site to build and deploy.

Our error-handling strategy:

Visual callouts: Malformed gallery/link-preview blocks render as red error cards with details
Dev warnings: Console warnings for missing or unused citation keys (development only)
Regression tests: Dedicated test pages (test-malformed-blocks.md, test-inline-code-escape.md) catch edge cases

Math Rendering

LaTeX math expressions introduce several challenges when mixed with markdown. Dollar signs, angle brackets, and underscores all have special meanings in markdown, and incorrect escaping or wrapping can break rendering entirely. Our approach: client-side MathJax with careful preprocessing to handle these edge cases.

Challenges (and why they matter):

Delimiter conflicts: $ as currency vs. math delimiter, _ triggers emphasis
HTML escaping: x < y breaks if < becomes < too early
Paragraph wrapping: Display math shouldn't be wrapped in <p> tags
Equation numbering: Need to reset counters per page
Hydration timing: MathJax script might load after content renders

Solution:

Protect with placeholders (shown above), restore with escaping:

const escaped = block.replace(/</g, '&lt;').replace(/>/g, '&gt;')
html = html.replace(placeholder,
  /^\s*\$\$/.test(block) ? `<div class="math-display">${escaped}</div>` : escaped
)

Unwrap misplaced paragraphs:

html = html.replace(/<p>\s*<div class="math-display">([\s\S]*?)<\/div>\s*<\/p>/g,
  (_, inner) => `<div class="math-display">${inner}</div>`)

MathJax config:

window.MathJax = {
  tex: {
    inlineMath: [['$', '$']],
    displayMath: [['$$', '$$']],
    packages: { '[+]': ['ams', 'noerrors', 'noundefined'] },
    macros: { argmin: "\\mathop{\\mathrm{argmin}}", argmax: "\\mathop{\\mathrm{argmax}}" },
tags: ["Frontend Development"]
  }
}

Client-side rendering:

useEffect(() => {
  window.MathJax?.texReset?.()  // Reset equation numbers per page
  contentRef.current.innerHTML = html

  const tryTypeset = () => {
    window.MathJax?.typesetPromise?.([contentRef.current!]).catch(console.warn)
  }

  tryTypeset()

  // Retry with backoff (script might load late)
  const interval = setInterval(() => {
    if (window.MathJax?.typesetPromise) {
      tryTypeset()
      clearInterval(interval)
    }
  }, 150)

  return () => clearInterval(interval)
}, [html])

Citation Keys

For technical and academic posts, we needed a lightweight citation system that works in plain markdown. The solution: use citation keys like [bib-key] in the body text, then automatically convert them to numbered references with anchor links and hover previews.

Input:

Recent work [bib-ho20; bib-song21] shows promise.

## References
1. [bib-ho20] Jonathan Ho, et al. "Denoising Diffusion..."
2. [bib-song21] Yang Song, et al. "Score-Based Generative..."

Output:

Recent work [1][2] shows promise.

Implementation (pre-markdown):

const buildCitationMapAndTransform = (src: string) => {
  // Parse References section → build key-to-number map
  const citations = new Map<string, number>()

  // Transform citations in body
  const citationRegex = /\[(?:@?bib-[^\]\s,;]+(?:\s*[;,]\s*@?bib-[^\]\s,;]+)*)\]/g
  return bodyText.replace(citationRegex, (match) => {
    const keys = match.slice(1, -1).split(/[;,]/).map(s => s.trim())
    const nums = keys.map(k => citations.get(k)).filter(Boolean)
    return nums.map(n =>
      `<a class="ref" href="#ref-${n}" data-ref-title="${escapeHtml(refText[n])}">[${n}]</a>`
    ).join('')
  })
}

Dev validation: Console warnings for missing keys or unused references. Front-matter citationKeysMode: "strip" removes citations for drafts.

Inline Code Escaping Bug

An unexpected edge case emerged when documenting the pipeline itself: inline code containing HTML comments (like ) would render as empty <code></code> tags because browsers interpret the comment syntax and hide the content.

The problem: When we write `` in markdown, the backticks should protect it, but after HTML generation, the browser sees <code></code> and treats the comment as invisible.

Fix: Protect before markdown processing:

protectedContent = protectedContent.replace(/`([^`]*<!--[^`]*-->[^`]*)`/gi, (match) => {
  protectedBlocks.push(match)
  return `EXAMPLE_INLINE_PH_${protectedBlocks.length - 1}_END`
})

Restore with escaping:

block.replace(/`([^`]+)`/g, (_, content) => 
  `<code>${content.replace(/</g, '&lt;').replace(/>/g, '&gt;')}</code>`
)

The floating TOC is conditionally rendered to avoid clutter on short posts:

Visibility criteria:

Page has ≥3 headings (substantial structure)
Content is ≥3500 characters (long enough to benefit from navigation)
Viewport is ≥1280px (desktop only; mobile uses native scrolling)

Implementation:

Custom remark plugin generates URL-friendly heading IDs at build time
Tocbot library creates the floating TOC, auto-updating active states on scroll
Smooth scroll to sections on click

Performance & Bundle Size

The pipeline is optimized for fast initial loads with progressive enhancement:

Static HTML first: All markdown → HTML at build time (~50-100ms/post), no runtime parsing
Minimal client JS: Only hydrate interactive features (galleries, MathJax, TOC)
External dependencies: MathJax via CDN, Tocbot dynamically imported (not in main bundle)
Image optimization: Next.js built-in optimization for all images
Hydration cost: <500ms for long posts with math + galleries
Progressive enhancement: Content readable immediately; interactivity loads progressively

TODO: Link preview caching (Vercel KV, 7-day TTL to reduce Open Graph fetching)

Migration from Ghost

This pipeline emerged from migrating a Ghost blog with hundreds of posts containing galleries and rich embeds. Rather than losing that structured content, we built custom processing to preserve it in plain markdown.

Migration script workflow:

Fetch all posts via Ghost Content API
Convert Mobiledoc (Ghost's JSON format) → plain markdown
Download and optimize all referenced images
Convert Ghost gallery/bookmark cards → :::example-gallery and :::example-link-preview markers

The full migration strategy is documented in web/content/docs/2025-05-29-rich-content-preservation-plan.md. The key win: no content loss, no manual reformatting.

Key Lessons Learned

Placeholder pattern is essential: The core technique for protecting fragile content (math, galleries, inline code) through multiple transformation passes without corruption
Author experience > implementation ease: Plain markdown is harder to build but dramatically better for content authors—worth the engineering investment
Fail visibly, not silently: Red error callouts in the UI beat silent failures or build breaks; authors can fix issues quickly
Test edge cases religiously: Dedicated test pages (test-malformed-blocks.md, test-inline-code-escape.md, test-math.md) catch regressions early
Single-pass hydration matters: Running DOM transformations once prevents race conditions, duplicate content, and hydration mismatches
Dev tooling pays dividends: Console warnings for missing citations, TypeScript types for all content utilities, and visual error feedback create a tight feedback loop

Future Extensions

The placeholder pattern is highly extensible—adding new content types follows the same workflow: extract with regex, validate, protect during processing, restore with proper escaping.

Planned additions:

Link preview caching: Vercel KV store with 7-day TTL to avoid repeated Open Graph fetching
Data visualizations: :::chart blocks connected to Supabase for health/habit data
Collapsible callouts: Expandable sections for long footnotes or technical details
Video embeds: Responsive wrappers for YouTube/Vimeo with lazy loading
Code diff blocks: Side-by-side diffs for refactoring walkthroughs
Inline footnotes: Hover-triggered annotations without jumping to references

Each new feature requires three pieces: extraction regex, validation function, and restoration logic. The existing infrastructure handles everything else.

Conclusion

Building a custom markdown pipeline requires more upfront work than adopting MDX, but the payoff is substantial: writers get a simple, portable authoring experience while the site maintains rich interactive features.

What this approach provides:

Simple authoring: Pure markdown that any writer can use
Reliable builds: Deterministic output, no environment-dependent rendering
Clean diffs: Code reviews focus on content, not component logic
Extensibility: New features plug into the existing placeholder system

Critical architectural decisions:

Placeholder protection: Core technique for preserving fragile content through transformations
Build-time validation: Catch errors early with visual feedback
Client-side hydration: Fast initial loads, progressive interactivity
Single-pass processing: Avoid race conditions and duplicate content
Comprehensive testing: Test pages prevent regressions

Our recommendation: For content-heavy sites (blogs, documentation, marketing), plain markdown + custom processing beats MDX. You get better authoring UX, cleaner content portability, and deterministic builds. Only reach for MDX if you need deeply interactive components (calculators, dashboards, complex forms) embedded directly in content.

Test Suites

For extensive testing and live examples of all pipeline features, see our comprehensive test pages:

Test: Math Rendering - Complex LaTeX equations, machine learning formulas, Maxwell's equations, custom macros, and edge cases
Test: Image Galleries - Auto-detected galleries (1-18+ images), explicit gallery markers, rich captions, and layout testing
Test: Code Blocks and Syntax Highlighting - Syntax highlighting across 10+ programming languages with realistic examples
Test: Malformed Gallery and Link-Preview Blocks - Error handling validation with visual error callouts for debugging
Test: Inline Code Escaping - Validates that backticked HTML comments and tags render literally as code
Test: Basic Markdown Features - Testing core markdown syntax including headers, lists, tables, links, and formatting
Test: Iframe Embedding - Testing iframe embedding for videos, maps, and other external content
Test: URL Preview Cards - Testing link preview card functionality for external URLs
Test: Miscellaneous Features - Additional testing for various features and edge cases

These test suites serve as both regression testing and showcase pages for the pipeline's full capabilities.