Building a Rehype Plugin

I was recently in the process of fixing syntax highlighting on this blog and found I was missing some important tools. I write my posts in Markdown and process them using unified. The standard syntax for embedding a code example in Markdown is a code fence, three backticks surrounding a code block:

Introducing some example code:

```
print "Hello world!"
```

I've been using highlight.js as part of my unified pipeline - or more specifically the rehype-highlight plugin that wraps highlight.js. But unfortunately JSX highlighting for React code examples still isn't possible in highlight.js per this open issue:

Support for JSX/TSX is not a single thing. We already include "basic" support for embedded XML/HTML fragments using the xml sublanguage. We do not support comments or embedded JS. This has all been attempted several times and is a very, very hard problem. It may actually be impossible for us to solve for various reasons.

And this does make sense given more context around the design philosophy of highlight.js:

Philosophical: even if we have the tools we purposely do not build full grammar parsers (out of scope), we only do smart pattern matching.

It's a great library but it has a tradeoff of being lightweight over full language support. Since I'm running highlighting as part of a build-time process, being lightweight wasn't a big consideration in my case so I needed to look for an alternative.

Alternatives

The primary competitor to highlight.js is Prism. Mapbox has helpfully released a unified-compatible wrapper, rehype-prism, that's a drop-in replacement for rehype-highlight. I switched to rehype-prism in my Markdown processing pipeline, and with some basic SCSS it was possible to get very good JSX syntax highlighting:

/* Based on Github light theme */
code.language-js,
code.language-typescript,
code.language-jsx {
  .keyword, .builtin /* Typescript-specific token */ {
    color: #d73a49;
  }
  .number,
  .boolean,
  .string,
  .string-property,
  .attr-value {
    color: #e36209;
  }
  .constant {
    color: #005cc5;
  }
  .comment {
    color: #6a737d;
  }
  .function,
  .function-variable {
    color: #6f42c1;
  }
  .class-name, .tag /* JSX-specific token */ {
    color: #22863a;
  }
  .literal-property, .attr-name /* JSX-specific token */ {
    color: #005cc5;
  }
  /* Prevents opening/closing JSX brackets from being highlighted */
  .punctuation {
    color: black;
  }
}

And those CSS rules will just work for many other declarative programming languages.

Highlighting Diffs

There's still a gap when you try to apply syntax highlighting to diff blocks. By this I mean code blocks that look like:

- const myVar = "";
+ const myVar = "Hello";
  console.log("My var:", myVar);

Notice that there's no higlighting beyond the lines added / removed? How do you encourage the highlighter to highlight the changes, but also still show highlighted code in the unchanged areas? The issue is that the parser is using rules for the diff "language". The only real tokens in a diff are + and -, and everything else is treated as a plain string.

This comment on a highlight.js GitHub issue gave me an idea on how to fix this:

I implemented something like this for codediff.js. You run both syntax highlighting and diff on the before & after, then combine the results. The tricky bit is that you need to make sure none of the syntax highlighting <span>s cross lines.

So here's how I implemented that idea in a rehype + Prism pipeline.

Writing a Rehype Plugin

Overall it wasn't too hard starting from rehype-prism's implementation as an example. The core of how it works is to use visit() (from unist-util-visit) to traverse the AST generated upstream by the Markdown processing pipeline. You then replace the contents of any visited <code> elements with syntax-highlighted HTML. To get diff highlighting, the simplest compromise I could think of was to apply highlighting to the unchanged portions of a diff, and apply diff highlighting to the rest:

import { toText } from "hast-util-to-text";
import { refractor } from "refractor";
import { visit } from "unist-util-visit";

// register each language you want to highlight, including diff
import langDiff from "refractor/lang/diff.js";
import langJs from "refractor/lang/javascript.js";
import langJsx from "refractor/lang/jsx.js";
import langTs from "refractor/lang/typescript.js";

refractor.register(langDiff);
refractor.register(langJs);
refractor.register(langJsx);
refractor.register(langTs);

...

/**
 * Rehype plugin to highlight code blocks with Prism.js, based on rehype-prism.
 * Performs additional processing to perform highlighting within diff code blocks.
 */
export function rehypeDiffHighlight() {
  return (tree) => {
    visit(tree, "element", (node, index, parent) => {
      // Only rewrite <pre><code> elements
      if (!parent || parent.tagName !== "pre" || node.tagName !== "code") {
        return;
      }

      const lang = getLanguage(node);

      if (!lang) {
        return;
      }

      const codeElemValue = toText(node, { whitespace: "pre" });

      if (lang.startsWith("diff-")) {
        const diffSublanguage = lang.slice(5); // the language abbreviation after "diff-" e.g. "diff-jsx"
        const highlightedCodeAst = refractor.highlight(codeElemValue, "diff");

        /* each (non-text) child is a block of the diff and contains 1+ lines */
        for (const diffBlockElem of highlightedCodeAst.children) {
          if (
            diffBlockElem.type !== "element" ||
            diffBlockElem.tagName !== "span"
          ) {
            continue;
          }

          // Replace unchanged blocks with highlighted code
          if (hastHasClassName(diffBlockElem, "unchanged")) {
            const diffBlockElemValue = toText(diffBlockElem, {
              whitespace: "pre",
            });

            const highlightedDiffBlock = refractor.highlight(
              diffBlockElemValue,
              diffSublanguage
            );

            diffBlockElem.children = highlightedDiffBlock.children;
          }
        }
        node.children = highlightedCodeAst.children;
      } else {
        // Highlight non-diff languages as normal
        const highlightedCodeAst = refractor.highlight(codeElemValue, lang);

        // Diff. between refractor v3 and v4, result is a single ast root node so can't
        // be attached directly to our <code> node. Compare to
        // https://github.com/mapbox/rehype-prism/blob/main/index.js
        node.children = highlightedCodeAst.children;
      }
    });
  };
}

Gotchas

Use hast-util-to-text instead of hast-util-to-string to recursively flatten a block into a newline-separate text string.

I also needed to match upstream hast typings between hast-util-to-text and refractor. hast-util-to-text v4 points to @types/hast v3, but refractor v4.8 points to @types/hast v2. These were the versions I needed to set to satisfy the Typescript compiler:

  ...
  "hast-util-to-text": "3.1.2",
  "refractor": "^4.8.1",
  ...