Building a Custom Markdown Pipeline with Rich Embeds

8 min read

How we built a custom markdown pipeline that handles LaTeX math, image galleries, and rich embeds while keeping content in plain .md files—no MDX required.

Building a Custom Markdown Pipeline with Rich Embeds
Photo by MW on Unsplash

This post documents the custom markdown processing pipeline that powers this site. It handles LaTeX math, interactive galleries, link previews, and academic citations while maintaining plain .md files for portability and simple authoring. If you're building a content site and weighing MDX vs. custom markdown processing, this deep-dive covers the architecture, challenges, and trade-offs.

Why Plain Markdown Over MDX

MDX is powerful, but for content-heavy sites, it introduces unnecessary complexity. After evaluating both approaches for this site, we chose plain markdown with custom processing for several practical reasons.

MDX's downsides for content authoring:

  • Build fragility: A single syntax error breaks the entire build, blocking publication
  • Nondeterminism: Dynamic components can produce different output across environments
  • Collaboration barrier: Writers need React knowledge to add components
  • Noisy diffs: JSX mixed with prose makes content reviews difficult
  • Tooling overhead: Editors, linters, and preview tools all need MDX-specific support

Plain markdown's advantages:

  • Deterministic builds: Same input → same output, every time
  • Universal authoring: Anyone who knows markdown can contribute
  • Clean diffs: Code review focuses on content changes, not component logic
  • Portability: Content works with any markdown processor (no vendor lock-in)

The key insight: we only need a small set of rich embeds (galleries, link previews, math), which can be expressed declaratively without full React component power. Custom preprocessing handles these special cases while keeping 95% of the content as pure markdown.

Architecture

The pipeline splits work between build-time (static HTML generation) and runtime (client-side interactivity). Here's the complete flow:

BUILD TIME (Server):
Markdown file
  ↓ Front-matter extraction
  ↓ Citation pre-processing
  ↓ Placeholder protection
  ↓ Unified: remark-parse → remark-gfm → remark-heading-ids → 
     remark-rehype → rehype-highlight → rehype-stringify
  ↓ Placeholder restoration
  ↓ Post-processing
  ↓ Static HTML output
  
RUNTIME (Client):
  ↓ Parse and hydrate galleries (JSON parse, event handlers)
  ↓ Fetch and render link previews (API calls, skeleton → card)
  ↓ Typeset math (MathJax)
  ↓ Initialize TOC (Tocbot)

Build-time processing generates static HTML for fast initial loads. Client-side hydration adds interactivity without blocking render. This split provides:

  • Fast page loads (static HTML)
  • SEO-friendly content (crawlers see everything)
  • Progressive enhancement (rich features load incrementally)

Three key build-time stages:

  1. Pre-processing: Protect special content with placeholders
  2. Custom plugins: Heading IDs and citations
  3. Post-processing: Restore content with proper escaping

The Placeholder System

The core challenge in processing markdown with rich content is protecting fragile syntax from being mangled by the parser. LaTeX expressions, HTML comments, and structured data blocks all contain characters that markdown processors interpret as special syntax.

The problem:

Example (raw markdown):

$$f(x) = \sum_{i=1}^{n} x_i < \alpha$$

Example (rendered):

$f(x) = \sum_{i=1}^{n} x_i < \alpha$
  • < gets interpreted as HTML tag
  • Characters get escaped incorrectly
  • Expression gets wrapped in <p> tags

Solution: Extract → Store → Replace → Process → Restore

Note: Snippets use EXAMPLE_* placeholder names so this article doesn't match the real restoration regexes. The production pipeline prefixes placeholders with tokens like MATH_PLACEHOLDER_*_END.

const protectedBlocks: string[] = []

// Protect display math
protectedContent = protectedContent.replace(/\$\$([\s\S]*?)\$\$/g, (match) => {
  protectedBlocks.push(match)
  return `EXAMPLE_MATH_PH_${protectedBlocks.length - 1}_END`
})

// Protect inline math
protectedContent = protectedContent.replace(/(?<!\$)\$(?!\$)(.*?)\$/g, (match) => {
  protectedBlocks.push(match)
  return `EXAMPLE_MATH_PH_${protectedBlocks.length - 1}_END`
})

// Protect galleries (with validation)
protectedContent = protectedContent.replace(/:::example-gallery[\s\S]*?:::/gi, (match) => {
  const validation = validateGalleryBlock(match.trim())
  protectedBlocks.push(validation.isValid ? match : generateErrorHTML(validation.error))
  return `EXAMPLE_GALLERY_PH_${protectedBlocks.length - 1}_END`
})

// Process through Unified...

// Restore with proper escaping
protectedBlocks.forEach((block, index) => {
  const placeholder = `EXAMPLE_MATH_PH_${index}_END`
  if (html.includes(placeholder)) {
    const escaped = block.replace(/</g, '&lt;').replace(/>/g, '&gt;')
    html = html.replace(placeholder, 
      /^\s*\$\$/.test(block) ? `<div class="math-display">${escaped}</div>` : escaped
    )
  }
})

Placeholder strings are unique, alphanumeric (so they survive markdown processing intact), and indexed (to support multiple instances).

Beyond text and math, modern content sites need interactive galleries and rich link previews. Rather than requiring authors to write React components, we use a declarative triple-colon syntax that looks like markdown but triggers custom processing.

Note: The code examples below use :::example-* markers to prevent this documentation from being processed by the pipeline.

Gallery syntax:

:::example-gallery
images=[
  {"src": "/images/photo1.jpg", "alt": "Sunset", "caption": "7000ft"},
  {"src": "/images/photo2.jpg", "alt": "Lake"}
]
:::

Build-time validation:

function validateGalleryBlock(block: string) {
  if (!block.includes('images=')) return { isValid: false, error: 'Missing images array' }
  const match = block.match(/images=\[([\s\S]*?)\]/)
  if (!match?.[1]?.trim()) return { isValid: false, error: 'Empty images array' }
  return { isValid: true }
}

Invalid blocks render as red error callouts with helpful debugging information.

Client hydration:

const processExplicitGalleryMarkers = (container: HTMLElement) => {
  const walker = document.createTreeWalker(container, NodeFilter.SHOW_TEXT)
  
  // Find all :::example-gallery blocks
  const blocks = []
  let node
  while (node = walker.nextNode()) {
    if ((node as Text).textContent?.includes(':::example-gallery')) blocks.push(node)
  }
  
  blocks.forEach(textNode => {
    const match = textNode.textContent.match(/:::example-gallery\s*([\s\S]*?)\s*:::/i)
    if (!match) return
    
    const images = JSON.parse(match[1].match(/images\s*=\s*(\[[\s\S]*?\])/i)[1].replace(/'/g, '"'))
    
    const gallery = document.createElement('div')
    gallery.className = `image-gallery image-gallery-${images.length}`
    
    images.forEach((img, idx) => {
      const el = document.createElement('img')
      el.src = img.src
      el.alt = img.alt || ''
      el.style.cursor = 'pointer'
      el.onclick = () => openLightbox(images, idx)
      gallery.appendChild(el)
    })
    
    textNode.parentElement?.parentNode?.replaceChild(gallery, textNode.parentElement)
  })
}

Lightbox provides keyboard navigation (arrows, Escape), scroll locking, and touch-friendly controls.

Link preview syntax:

:::example-link-preview
url="https://example.com"
title="Optional"
description="Optional"
:::

If metadata provided, render immediately. Otherwise fetch from /api/link-preview (Open Graph scraper with image scoring heuristic).

Client-Side Hydration

The build process generates static HTML for fast page loads, but interactive features (galleries, link previews, math) need client-side JavaScript to become functional. The key is hydrating efficiently—running transformations once and avoiding duplicate processing.

Build time (server):

  • Parse markdown and validate all syntax
  • Generate heading IDs for anchor links
  • Output static HTML with placeholder markers

Render time (client):

  • Hydrate gallery markers → interactive image grids
  • Hydrate link preview markers → rich cards (with API fetching)
  • Typeset all math expressions via MathJax
  • Initialize floating TOC with active state tracking

Critical: Single-pass DOM walk to avoid double hydration:

const processContent = useCallback((container: HTMLElement) => {
  processExplicitGalleryMarkers(container)
  processExplicitLinkPreviews(container)

  // Auto-detect multi-image paragraphs
  container.querySelectorAll('p').forEach(p => {
    const images = p.querySelectorAll('img')
    if (images.length >= 2) {
      p.classList.add('image-gallery', `image-gallery-${images.length}`)
      // Add handlers...
    }
  })
}, [])

useEffect(() => {
  contentRef.current.innerHTML = html
  processContent(contentRef.current)  // Single call
}, [html, processContent])

Validation and Error Handling

Rather than failing the build on syntax errors, we render visual error messages in the output. This lets authors see exactly what's wrong and where, while still allowing the site to build and deploy.

Our error-handling strategy:

  • Visual callouts: Malformed gallery/link-preview blocks render as red error cards with details
  • Dev warnings: Console warnings for missing or unused citation keys (development only)
  • Regression tests: Dedicated test pages (test-malformed-blocks.md, test-inline-code-escape.md) catch edge cases

Math Rendering

LaTeX math expressions introduce several challenges when mixed with markdown. Dollar signs, angle brackets, and underscores all have special meanings in markdown, and incorrect escaping or wrapping can break rendering entirely. Our approach: client-side MathJax with careful preprocessing to handle these edge cases.

Challenges (and why they matter):

  1. Delimiter conflicts: $ as currency vs. math delimiter, _ triggers emphasis
  2. HTML escaping: x < y breaks if < becomes &lt; too early
  3. Paragraph wrapping: Display math shouldn't be wrapped in <p> tags
  4. Equation numbering: Need to reset counters per page
  5. Hydration timing: MathJax script might load after content renders

Solution:

Protect with placeholders (shown above), restore with escaping:

const escaped = block.replace(/</g, '&lt;').replace(/>/g, '&gt;')
html = html.replace(placeholder,
  /^\s*\$\$/.test(block) ? `<div class="math-display">${escaped}</div>` : escaped
)

Unwrap misplaced paragraphs:

html = html.replace(/<p>\s*<div class="math-display">([\s\S]*?)<\/div>\s*<\/p>/g,
  (_, inner) => `<div class="math-display">${inner}</div>`)

MathJax config:

window.MathJax = {
  tex: {
    inlineMath: [['$', '$']],
    displayMath: [['$$', '$$']],
    packages: { '[+]': ['ams', 'noerrors', 'noundefined'] },
    macros: { argmin: "\\mathop{\\mathrm{argmin}}", argmax: "\\mathop{\\mathrm{argmax}}" },
tags: ["Frontend Development"]
  }
}

Client-side rendering:

useEffect(() => {
  window.MathJax?.texReset?.()  // Reset equation numbers per page
  contentRef.current.innerHTML = html

  const tryTypeset = () => {
    window.MathJax?.typesetPromise?.([contentRef.current!]).catch(console.warn)
  }

  tryTypeset()

  // Retry with backoff (script might load late)
  const interval = setInterval(() => {
    if (window.MathJax?.typesetPromise) {
      tryTypeset()
      clearInterval(interval)
    }
  }, 150)

  return () => clearInterval(interval)
}, [html])

Citation Keys

For technical and academic posts, we needed a lightweight citation system that works in plain markdown. The solution: use citation keys like [bib-key] in the body text, then automatically convert them to numbered references with anchor links and hover previews.

Input:

Recent work [bib-ho20; bib-song21] shows promise.

## References
1. [bib-ho20] Jonathan Ho, et al. "Denoising Diffusion..."
2. [bib-song21] Yang Song, et al. "Score-Based Generative..."

Output:

Recent work [1][2] shows promise.

Implementation (pre-markdown):

const buildCitationMapAndTransform = (src: string) => {
  // Parse References section → build key-to-number map
  const citations = new Map<string, number>()

  // Transform citations in body
  const citationRegex = /\[(?:@?bib-[^\]\s,;]+(?:\s*[;,]\s*@?bib-[^\]\s,;]+)*)\]/g
  return bodyText.replace(citationRegex, (match) => {
    const keys = match.slice(1, -1).split(/[;,]/).map(s => s.trim())
    const nums = keys.map(k => citations.get(k)).filter(Boolean)
    return nums.map(n =>
      `<a class="ref" href="#ref-${n}" data-ref-title="${escapeHtml(refText[n])}">[${n}]</a>`
    ).join('')
  })
}

Dev validation: Console warnings for missing keys or unused references. Front-matter citationKeysMode: "strip" removes citations for drafts.

Inline Code Escaping Bug

An unexpected edge case emerged when documenting the pipeline itself: inline code containing HTML comments (like <!-- link-preview -->) would render as empty <code></code> tags because browsers interpret the comment syntax and hide the content.

The problem: When we write `<!-- example-link-card -->` in markdown, the backticks should protect it, but after HTML generation, the browser sees <code><!-- ... --></code> and treats the comment as invisible.

Fix: Protect before markdown processing:

protectedContent = protectedContent.replace(/`([^`]*<!--[^`]*-->[^`]*)`/gi, (match) => {
  protectedBlocks.push(match)
  return `EXAMPLE_INLINE_PH_${protectedBlocks.length - 1}_END`
})

Restore with escaping:

block.replace(/`([^`]+)`/g, (_, content) => 
  `<code>${content.replace(/</g, '&lt;').replace(/>/g, '&gt;')}</code>`
)

Table of Contents

The floating TOC is conditionally rendered to avoid clutter on short posts:

Visibility criteria:

  • Page has ≥3 headings (substantial structure)
  • Content is ≥3500 characters (long enough to benefit from navigation)
  • Viewport is ≥1280px (desktop only; mobile uses native scrolling)

Implementation:

  • Custom remark plugin generates URL-friendly heading IDs at build time
  • Tocbot library creates the floating TOC, auto-updating active states on scroll
  • Smooth scroll to sections on click

Performance & Bundle Size

The pipeline is optimized for fast initial loads with progressive enhancement:

  • Static HTML first: All markdown → HTML at build time (~50-100ms/post), no runtime parsing
  • Minimal client JS: Only hydrate interactive features (galleries, MathJax, TOC)
  • External dependencies: MathJax via CDN, Tocbot dynamically imported (not in main bundle)
  • Image optimization: Next.js built-in optimization for all images
  • Hydration cost: <500ms for long posts with math + galleries
  • Progressive enhancement: Content readable immediately; interactivity loads progressively

TODO: Link preview caching (Vercel KV, 7-day TTL to reduce Open Graph fetching)

Migration from Ghost

This pipeline emerged from migrating a Ghost blog with hundreds of posts containing galleries and rich embeds. Rather than losing that structured content, we built custom processing to preserve it in plain markdown.

Migration script workflow:

  1. Fetch all posts via Ghost Content API
  2. Convert Mobiledoc (Ghost's JSON format) → plain markdown
  3. Download and optimize all referenced images
  4. Convert Ghost gallery/bookmark cards → :::example-gallery and :::example-link-preview markers

The full migration strategy is documented in web/content/docs/2025-05-29-rich-content-preservation-plan.md. The key win: no content loss, no manual reformatting.

Key Lessons Learned

  1. Placeholder pattern is essential: The core technique for protecting fragile content (math, galleries, inline code) through multiple transformation passes without corruption
  2. Author experience > implementation ease: Plain markdown is harder to build but dramatically better for content authors—worth the engineering investment
  3. Fail visibly, not silently: Red error callouts in the UI beat silent failures or build breaks; authors can fix issues quickly
  4. Test edge cases religiously: Dedicated test pages (test-malformed-blocks.md, test-inline-code-escape.md, test-math.md) catch regressions early
  5. Single-pass hydration matters: Running DOM transformations once prevents race conditions, duplicate content, and hydration mismatches
  6. Dev tooling pays dividends: Console warnings for missing citations, TypeScript types for all content utilities, and visual error feedback create a tight feedback loop

Future Extensions

The placeholder pattern is highly extensible—adding new content types follows the same workflow: extract with regex, validate, protect during processing, restore with proper escaping.

Planned additions:

  • Link preview caching: Vercel KV store with 7-day TTL to avoid repeated Open Graph fetching
  • Data visualizations: :::chart blocks connected to Supabase for health/habit data
  • Collapsible callouts: Expandable sections for long footnotes or technical details
  • Video embeds: Responsive wrappers for YouTube/Vimeo with lazy loading
  • Code diff blocks: Side-by-side diffs for refactoring walkthroughs
  • Inline footnotes: Hover-triggered annotations without jumping to references

Each new feature requires three pieces: extraction regex, validation function, and restoration logic. The existing infrastructure handles everything else.

Conclusion

Building a custom markdown pipeline requires more upfront work than adopting MDX, but the payoff is substantial: writers get a simple, portable authoring experience while the site maintains rich interactive features.

What this approach provides:

  • Simple authoring: Pure markdown that any writer can use
  • Reliable builds: Deterministic output, no environment-dependent rendering
  • Clean diffs: Code reviews focus on content, not component logic
  • Extensibility: New features plug into the existing placeholder system

Critical architectural decisions:

  1. Placeholder protection: Core technique for preserving fragile content through transformations
  2. Build-time validation: Catch errors early with visual feedback
  3. Client-side hydration: Fast initial loads, progressive interactivity
  4. Single-pass processing: Avoid race conditions and duplicate content
  5. Comprehensive testing: Test pages prevent regressions

Our recommendation: For content-heavy sites (blogs, documentation, marketing), plain markdown + custom processing beats MDX. You get better authoring UX, cleaner content portability, and deterministic builds. Only reach for MDX if you need deeply interactive components (calculators, dashboards, complex forms) embedded directly in content.

Test Suites

For extensive testing and live examples of all pipeline features, see our comprehensive test pages:

These test suites serve as both regression testing and showcase pages for the pipeline's full capabilities.


Copyright 2025, Ran DingPrivacyTerms
Building a Custom Markdown Pipeline with Rich Embeds - Dev Notes