Escape HTML Inside <code> or <pre> Tag to Entities to Display Raw Code with PrismJS

Sep 14, 2016 Table of Contents

Introduction

Have you ever wondered how websites present HTML code, without the page mistaking it for part of the DOM? This tutorial will cover how to correctly do so. The quintessential usage of this is seen on the coding blocks on this very page.

Security will not be covered, since it will be assumed that you are the one inputting the data, not a user. If you would like to prevent against Cross-Site Scripting (XSS), then try out the HTML Purifier.

The Regex Way

You're not Chuck Norris, so don't even try to parse your HTML with a regex. Regular expressions seem like a good idea in theory, but the reality is that HTML is simply too unpredictable for regexes to accurately work 100% of the time.

I will provide the code, but please proceed with caution. I would certainly never recommend to use this in production mode. Read this hilarious explanation to further understand why it is a terrible idea to implement this method.

function escapeTxtInCodeTag($txt) {
 $callback = function($matches) {
   return '<code' . $matches[1] . '>' . htmlentities($matches[2], ENT_QUOTES, 'UTF-8') . '</code>';
 };
 $txt = preg_replace_callback('#\<\s*code(.*?)>(.+?)<\s*\/code\s*>#', $callback, $txt);
 return $txt;
}

DOMDocument to the Rescue?

The PHP DOMDocument class is an excellent way to parse HTML, as it allows you to easily traverse or modify text as if it were an actual HTML element. This is the perfect solution for most cases, since you have full parsing control. However, I ran into an issue when I tried using Apache Configuration code in another tutorial. My example contained <VirtualHost *:80>, but when I ran the code, it converted it to <virtualhost>.

Why did it do this you might ask? Well, DomDoc is meant to be an HTML/XML parser. Because of that, it thinks it's helping you out by converting invalid HTML to valid HTML. However, this makes it impossible to ever show coding examples in Apache Config. This is probably not an issue for most people anyway.

function DOMinnerHTML($element) { 
 $innerHTML = ''; 
 $children = $element->childNodes;
 foreach ($children as $child) { 
   $innerHTML .= $element->ownerDocument->saveHTML($child);
 }
 return htmlspecialchars($innerHTML, ENT_QUOTES, 'UTF-8'); 
}
function escapeTxtInCodeTag($txt) {
 if(empty($txt)) return false; //if value is empty, quit out of function to prevent errors thrown
 $dom = new DOMDocument('1.0', 'utf-8');
 libxml_use_internal_errors(true);
 $dom->loadHTML(mb_convert_encoding($txt, 'HTML-ENTITIES', 'UTF-8'), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
 $domCodeTag = $dom->getElementsByTagName('code');
 foreach ($domCodeTag as $vals) {
   $codeClassName = $vals->getAttribute('class');
   $newCodeWrap = $dom->createElement('code');
   $newCodeWrap->setAttribute('class', $codeClassName);
   $newCodeWrap->nodeValue = DOMinnerHTML($vals);
   $vals->parentNode->replaceChild($newCodeWrap, $vals);
 }
 echo $dom->saveHTML();
}

Markdown

This is easily the best approach, in my opinion. Markdown, as it is cleverly named, is a language that is an extra-lite version of HTML — no, I'm not referring to the atrocious BBCode (bleh).

It's awesome, because it automatically converts anything in a code block with <, > and &, to HTML entities. An important note is that Markdown should be stored in the database, and isn't converted to HTML until after it is retrieved. This allows you to make edits to your posts.

Parsedown is undoubtedly an outstanding Markdown parser, but what if you want to add syntax highlighting, like PrismJS?

Enter PHP Markdown Extra. Markdown Extra, as its name implies, adds extra stuff to the existing Markdown syntax.

One of the coolest parts is the fact that you can actually add attributes, like class names. For example, if I want show some CSS code, I would do this:

``` {.language-css .line-numbers}
/* CSS code */
```

Markdown Extra allows you to apply classes to either <pre> or <code>, but not both at the same time for some reason. This might seem like it would cause a predicament. Nevertheless, PrismJS's parser is smart enough put the classes in the correct spot. It permits inheritance, which means that if you apply .language-x to <pre>, then it will still also add it to <code>.

The extensions all seem to be meant for code blocks, so it would be wise to apply all of the classes to <pre>, instead of <code>. You can achieve this by copying line 6 of the following example.

If you do not need line numbers or any other extensions, then there is a way to shorten this even more, by using line 5. Let's take a look at how we would convert Markdown HTML.

use \Michelf\MarkdownExtra;
require_once 'Michelf/MarkdownExtra.inc.php';
function markDownToUp($markdown) {
 $parser = new MarkdownExtra;
 $parser->code_class_prefix = 'language-'; //allows .css class name
 $parser->code_attr_on_pre = true; //all attributes in code block apply to pre
 return $parser->transform($markdown);
}

Now you can do this:

``` .css
/* CSS code */
```

Unfortunately, you can't assign attributes to inline code in Markdown Extra. At first I was extremely irritated that there was no way for me to highlight inline code. However, now that I've had time to adjust, I actually think that it's better to not have syntax highlighting on inline code anyway, since it looks a lot cleaner without it. Just my two cents.