This post documents the custom markdown processing pipeline that powers this site. It handles LaTeX math, interactive galleries, link previews, and academic citations while maintaining plain .md files for portability and simple authoring. If you're building a content site and weighing MDX vs. custom markdown processing, this deep-dive covers the architecture, challenges, and trade-offs.
Why Plain Markdown Over MDX
MDX is powerful, but for content-heavy sites, it introduces unnecessary complexity. After evaluating both approaches for this site, we chose plain markdown with custom processing for several practical reasons.
MDX's downsides for content authoring:
- Build fragility: A single syntax error breaks the entire build, blocking publication
- Nondeterminism: Dynamic components can produce different output across environments
- Collaboration barrier: Writers need React knowledge to add components
- Noisy diffs: JSX mixed with prose makes content reviews difficult
- Tooling overhead: Editors, linters, and preview tools all need MDX-specific support
Plain markdown's advantages:
- Deterministic builds: Same input → same output, every time
- Universal authoring: Anyone who knows markdown can contribute
- Clean diffs: Code review focuses on content changes, not component logic
- Portability: Content works with any markdown processor (no vendor lock-in)
The key insight: we only need a small set of rich embeds (galleries, link previews, math), which can be expressed declaratively without full React component power. Custom preprocessing handles these special cases while keeping 95% of the content as pure markdown.
Architecture
The pipeline splits work between build-time (static HTML generation) and runtime (client-side interactivity). Here's the complete flow:
BUILD TIME (Server):
Markdown file
↓ Front-matter extraction
↓ Citation pre-processing
↓ Placeholder protection
↓ Unified: remark-parse → remark-gfm → remark-heading-ids →
remark-rehype → rehype-highlight → rehype-stringify
↓ Placeholder restoration
↓ Post-processing
↓ Static HTML output
RUNTIME (Client):
↓ Parse and hydrate galleries (JSON parse, event handlers)
↓ Fetch and render link previews (API calls, skeleton → card)
↓ Typeset math (MathJax)
↓ Initialize TOC (Tocbot)
Build-time processing generates static HTML for fast initial loads. Client-side hydration adds interactivity without blocking render. This split provides:
- Fast page loads (static HTML)
- SEO-friendly content (crawlers see everything)
- Progressive enhancement (rich features load incrementally)
Three key build-time stages:
- Pre-processing: Protect special content with placeholders
- Custom plugins: Heading IDs and citations
- Post-processing: Restore content with proper escaping
The Placeholder System
The core challenge in processing markdown with rich content is protecting fragile syntax from being mangled by the parser. LaTeX expressions, HTML comments, and structured data blocks all contain characters that markdown processors interpret as special syntax.
The problem:
Example (raw markdown):
$$f(x) = \sum_{i=1}^{n} x_i < \alpha$$
Example (rendered):
<gets interpreted as HTML tag- Characters get escaped incorrectly
- Expression gets wrapped in
<p>tags
Solution: Extract → Store → Replace → Process → Restore
Note: Snippets use
EXAMPLE_*placeholder names so this article doesn't match the real restoration regexes. The production pipeline prefixes placeholders with tokens likeMATH_PLACEHOLDER_*_END.
const protectedBlocks: string[] = []
// Protect display math
protectedContent = protectedContent.replace(/\$\$([\s\S]*?)\$\$/g, (match) => {
protectedBlocks.push(match)
return `EXAMPLE_MATH_PH_${protectedBlocks.length - 1}_END`
})
// Protect inline math
protectedContent = protectedContent.replace(/(?<!\$)\$(?!\$)(.*?)\$/g, (match) => {
protectedBlocks.push(match)
return `EXAMPLE_MATH_PH_${protectedBlocks.length - 1}_END`
})
// Protect galleries (with validation)
protectedContent = protectedContent.replace(/:::example-gallery[\s\S]*?:::/gi, (match) => {
const validation = validateGalleryBlock(match.trim())
protectedBlocks.push(validation.isValid ? match : generateErrorHTML(validation.error))
return `EXAMPLE_GALLERY_PH_${protectedBlocks.length - 1}_END`
})
// Process through Unified...
// Restore with proper escaping
protectedBlocks.forEach((block, index) => {
const placeholder = `EXAMPLE_MATH_PH_${index}_END`
if (html.includes(placeholder)) {
const escaped = block.replace(/</g, '<').replace(/>/g, '>')
html = html.replace(placeholder,
/^\s*\$\$/.test(block) ? `<div class="math-display">${escaped}</div>` : escaped
)
}
})
Placeholder strings are unique, alphanumeric (so they survive markdown processing intact), and indexed (to support multiple instances).
Rich Embeds: Galleries and Link Previews
Beyond text and math, modern content sites need interactive galleries and rich link previews. Rather than requiring authors to write React components, we use a declarative triple-colon syntax that looks like markdown but triggers custom processing.
Note: The code examples below use
:::example-*markers to prevent this documentation from being processed by the pipeline.
Gallery syntax:
:::example-gallery
images=[
{"src": "/images/photo1.jpg", "alt": "Sunset", "caption": "7000ft"},
{"src": "/images/photo2.jpg", "alt": "Lake"}
]
:::
Build-time validation:
function validateGalleryBlock(block: string) {
if (!block.includes('images=')) return { isValid: false, error: 'Missing images array' }
const match = block.match(/images=\[([\s\S]*?)\]/)
if (!match?.[1]?.trim()) return { isValid: false, error: 'Empty images array' }
return { isValid: true }
}
Invalid blocks render as red error callouts with helpful debugging information.
Client hydration:
const processExplicitGalleryMarkers = (container: HTMLElement) => {
const walker = document.createTreeWalker(container, NodeFilter.SHOW_TEXT)
// Find all :::example-gallery blocks
const blocks = []
let node
while (node = walker.nextNode()) {
if ((node as Text).textContent?.includes(':::example-gallery')) blocks.push(node)
}
blocks.forEach(textNode => {
const match = textNode.textContent.match(/:::example-gallery\s*([\s\S]*?)\s*:::/i)
if (!match) return
const images = JSON.parse(match[1].match(/images\s*=\s*(\[[\s\S]*?\])/i)[1].replace(/'/g, '"'))
const gallery = document.createElement('div')
gallery.className = `image-gallery image-gallery-${images.length}`
images.forEach((img, idx) => {
const el = document.createElement('img')
el.src = img.src
el.alt = img.alt || ''
el.style.cursor = 'pointer'
el.onclick = () => openLightbox(images, idx)
gallery.appendChild(el)
})
textNode.parentElement?.parentNode?.replaceChild(gallery, textNode.parentElement)
})
}
Lightbox provides keyboard navigation (arrows, Escape), scroll locking, and touch-friendly controls.
Link preview syntax:
:::example-link-preview
url="https://example.com"
title="Optional"
description="Optional"
:::
If metadata provided, render immediately. Otherwise fetch from /api/link-preview (Open Graph scraper with image scoring heuristic).
Client-Side Hydration
The build process generates static HTML for fast page loads, but interactive features (galleries, link previews, math) need client-side JavaScript to become functional. The key is hydrating efficiently—running transformations once and avoiding duplicate processing.
Build time (server):
- Parse markdown and validate all syntax
- Generate heading IDs for anchor links
- Output static HTML with placeholder markers
Render time (client):
- Hydrate gallery markers → interactive image grids
- Hydrate link preview markers → rich cards (with API fetching)
- Typeset all math expressions via MathJax
- Initialize floating TOC with active state tracking
Critical: Single-pass DOM walk to avoid double hydration:
const processContent = useCallback((container: HTMLElement) => {
processExplicitGalleryMarkers(container)
processExplicitLinkPreviews(container)
// Auto-detect multi-image paragraphs
container.querySelectorAll('p').forEach(p => {
const images = p.querySelectorAll('img')
if (images.length >= 2) {
p.classList.add('image-gallery', `image-gallery-${images.length}`)
// Add handlers...
}
})
}, [])
useEffect(() => {
contentRef.current.innerHTML = html
processContent(contentRef.current) // Single call
}, [html, processContent])
Validation and Error Handling
Rather than failing the build on syntax errors, we render visual error messages in the output. This lets authors see exactly what's wrong and where, while still allowing the site to build and deploy.
Our error-handling strategy:
- Visual callouts: Malformed gallery/link-preview blocks render as red error cards with details
- Dev warnings: Console warnings for missing or unused citation keys (development only)
- Regression tests: Dedicated test pages (
test-malformed-blocks.md,test-inline-code-escape.md) catch edge cases
Math Rendering
LaTeX math expressions introduce several challenges when mixed with markdown. Dollar signs, angle brackets, and underscores all have special meanings in markdown, and incorrect escaping or wrapping can break rendering entirely. Our approach: client-side MathJax with careful preprocessing to handle these edge cases.
Challenges (and why they matter):
- Delimiter conflicts:
$as currency vs. math delimiter,_triggers emphasis - HTML escaping:
x < ybreaks if<becomes<too early - Paragraph wrapping: Display math shouldn't be wrapped in
<p>tags - Equation numbering: Need to reset counters per page
- Hydration timing: MathJax script might load after content renders
Solution:
Protect with placeholders (shown above), restore with escaping:
const escaped = block.replace(/</g, '<').replace(/>/g, '>')
html = html.replace(placeholder,
/^\s*\$\$/.test(block) ? `<div class="math-display">${escaped}</div>` : escaped
)
Unwrap misplaced paragraphs:
html = html.replace(/<p>\s*<div class="math-display">([\s\S]*?)<\/div>\s*<\/p>/g,
(_, inner) => `<div class="math-display">${inner}</div>`)
MathJax config:
window.MathJax = {
tex: {
inlineMath: [['$', '$']],
displayMath: [['$$', '$$']],
packages: { '[+]': ['ams', 'noerrors', 'noundefined'] },
macros: { argmin: "\\mathop{\\mathrm{argmin}}", argmax: "\\mathop{\\mathrm{argmax}}" },
tags: ["Frontend Development"]
}
}
Client-side rendering:
useEffect(() => {
window.MathJax?.texReset?.() // Reset equation numbers per page
contentRef.current.innerHTML = html
const tryTypeset = () => {
window.MathJax?.typesetPromise?.([contentRef.current!]).catch(console.warn)
}
tryTypeset()
// Retry with backoff (script might load late)
const interval = setInterval(() => {
if (window.MathJax?.typesetPromise) {
tryTypeset()
clearInterval(interval)
}
}, 150)
return () => clearInterval(interval)
}, [html])
Citation Keys
For technical and academic posts, we needed a lightweight citation system that works in plain markdown. The solution: use citation keys like [bib-key] in the body text, then automatically convert them to numbered references with anchor links and hover previews.
Input:
Recent work [bib-ho20; bib-song21] shows promise.
## References
1. [bib-ho20] Jonathan Ho, et al. "Denoising Diffusion..."
2. [bib-song21] Yang Song, et al. "Score-Based Generative..."
Output:
Implementation (pre-markdown):
const buildCitationMapAndTransform = (src: string) => {
// Parse References section → build key-to-number map
const citations = new Map<string, number>()
// Transform citations in body
const citationRegex = /\[(?:@?bib-[^\]\s,;]+(?:\s*[;,]\s*@?bib-[^\]\s,;]+)*)\]/g
return bodyText.replace(citationRegex, (match) => {
const keys = match.slice(1, -1).split(/[;,]/).map(s => s.trim())
const nums = keys.map(k => citations.get(k)).filter(Boolean)
return nums.map(n =>
`<a class="ref" href="#ref-${n}" data-ref-title="${escapeHtml(refText[n])}">[${n}]</a>`
).join('')
})
}
Dev validation: Console warnings for missing keys or unused references. Front-matter citationKeysMode: "strip" removes citations for drafts.
Inline Code Escaping Bug
An unexpected edge case emerged when documenting the pipeline itself: inline code containing HTML comments (like ) would render as empty <!-- link-preview --><code></code> tags because browsers interpret the comment syntax and hide the content.
The problem: When we write `<!-- example-link-card -->` in markdown, the backticks should protect it, but after HTML generation, the browser sees <code><!-- ... --></code> and treats the comment as invisible.
Fix: Protect before markdown processing:
protectedContent = protectedContent.replace(/`([^`]*<!--[^`]*-->[^`]*)`/gi, (match) => {
protectedBlocks.push(match)
return `EXAMPLE_INLINE_PH_${protectedBlocks.length - 1}_END`
})
Restore with escaping:
block.replace(/`([^`]+)`/g, (_, content) =>
`<code>${content.replace(/</g, '<').replace(/>/g, '>')}</code>`
)
Table of Contents
The floating TOC is conditionally rendered to avoid clutter on short posts:
Visibility criteria:
- Page has ≥3 headings (substantial structure)
- Content is ≥3500 characters (long enough to benefit from navigation)
- Viewport is ≥1280px (desktop only; mobile uses native scrolling)
Implementation:
- Custom remark plugin generates URL-friendly heading IDs at build time
- Tocbot library creates the floating TOC, auto-updating active states on scroll
- Smooth scroll to sections on click
Performance & Bundle Size
The pipeline is optimized for fast initial loads with progressive enhancement:
- Static HTML first: All markdown → HTML at build time (~50-100ms/post), no runtime parsing
- Minimal client JS: Only hydrate interactive features (galleries, MathJax, TOC)
- External dependencies: MathJax via CDN, Tocbot dynamically imported (not in main bundle)
- Image optimization: Next.js built-in optimization for all images
- Hydration cost: <500ms for long posts with math + galleries
- Progressive enhancement: Content readable immediately; interactivity loads progressively
TODO: Link preview caching (Vercel KV, 7-day TTL to reduce Open Graph fetching)
Migration from Ghost
This pipeline emerged from migrating a Ghost blog with hundreds of posts containing galleries and rich embeds. Rather than losing that structured content, we built custom processing to preserve it in plain markdown.
Migration script workflow:
- Fetch all posts via Ghost Content API
- Convert Mobiledoc (Ghost's JSON format) → plain markdown
- Download and optimize all referenced images
- Convert Ghost gallery/bookmark cards →
:::example-galleryand:::example-link-previewmarkers
The full migration strategy is documented in web/content/docs/2025-05-29-rich-content-preservation-plan.md. The key win: no content loss, no manual reformatting.
Key Lessons Learned
- Placeholder pattern is essential: The core technique for protecting fragile content (math, galleries, inline code) through multiple transformation passes without corruption
- Author experience > implementation ease: Plain markdown is harder to build but dramatically better for content authors—worth the engineering investment
- Fail visibly, not silently: Red error callouts in the UI beat silent failures or build breaks; authors can fix issues quickly
- Test edge cases religiously: Dedicated test pages (
test-malformed-blocks.md,test-inline-code-escape.md,test-math.md) catch regressions early - Single-pass hydration matters: Running DOM transformations once prevents race conditions, duplicate content, and hydration mismatches
- Dev tooling pays dividends: Console warnings for missing citations, TypeScript types for all content utilities, and visual error feedback create a tight feedback loop
Future Extensions
The placeholder pattern is highly extensible—adding new content types follows the same workflow: extract with regex, validate, protect during processing, restore with proper escaping.
Planned additions:
- Link preview caching: Vercel KV store with 7-day TTL to avoid repeated Open Graph fetching
- Data visualizations:
:::chartblocks connected to Supabase for health/habit data - Collapsible callouts: Expandable sections for long footnotes or technical details
- Video embeds: Responsive wrappers for YouTube/Vimeo with lazy loading
- Code diff blocks: Side-by-side diffs for refactoring walkthroughs
- Inline footnotes: Hover-triggered annotations without jumping to references
Each new feature requires three pieces: extraction regex, validation function, and restoration logic. The existing infrastructure handles everything else.
Conclusion
Building a custom markdown pipeline requires more upfront work than adopting MDX, but the payoff is substantial: writers get a simple, portable authoring experience while the site maintains rich interactive features.
What this approach provides:
- Simple authoring: Pure markdown that any writer can use
- Reliable builds: Deterministic output, no environment-dependent rendering
- Clean diffs: Code reviews focus on content, not component logic
- Extensibility: New features plug into the existing placeholder system
Critical architectural decisions:
- Placeholder protection: Core technique for preserving fragile content through transformations
- Build-time validation: Catch errors early with visual feedback
- Client-side hydration: Fast initial loads, progressive interactivity
- Single-pass processing: Avoid race conditions and duplicate content
- Comprehensive testing: Test pages prevent regressions
Our recommendation: For content-heavy sites (blogs, documentation, marketing), plain markdown + custom processing beats MDX. You get better authoring UX, cleaner content portability, and deterministic builds. Only reach for MDX if you need deeply interactive components (calculators, dashboards, complex forms) embedded directly in content.
Test Suites
For extensive testing and live examples of all pipeline features, see our comprehensive test pages:
- Test: Math Rendering - Complex LaTeX equations, machine learning formulas, Maxwell's equations, custom macros, and edge cases
- Test: Image Galleries - Auto-detected galleries (1-18+ images), explicit gallery markers, rich captions, and layout testing
- Test: Code Blocks and Syntax Highlighting - Syntax highlighting across 10+ programming languages with realistic examples
- Test: Malformed Gallery and Link-Preview Blocks - Error handling validation with visual error callouts for debugging
- Test: Inline Code Escaping - Validates that backticked HTML comments and tags render literally as code
- Test: Basic Markdown Features - Testing core markdown syntax including headers, lists, tables, links, and formatting
- Test: Iframe Embedding - Testing iframe embedding for videos, maps, and other external content
- Test: URL Preview Cards - Testing link preview card functionality for external URLs
- Test: Miscellaneous Features - Additional testing for various features and edge cases
These test suites serve as both regression testing and showcase pages for the pipeline's full capabilities.