21.2 HTML

HTML (HyperText Markup Language) underlies the majority of the web. It’s a special case of SGML (Standard Generalised Markup Language), and it’s similar but not identical to XML (eXtensible Markup Language). HTML looks like this:

<body>
  <h1 id='first'>A heading</h1>
  <p>Some text &amp; <b>some bold text.</b></p>
  <img src='myimg.png' width='100' height='100' />
</body>

Even if you’ve never looked at HTML before, you can still see that the key component of its coding structure is tags, which look like <tag></tag> or <tag />. Tags can be nested within other tags and intermingled with text. There are over 100 HTML tags, but in this chapter we’ll focus on just a handful:

<body> is the top-level tag that contains all content.
<h1> defines a top level heading.
<p> defines a paragraph.
<b> emboldens text.
<img> embeds an image.

Tags can have named attributes which look like <tag name1='value1' name2='value2'></tag>. Two of the most important attributes are id and class, which are used in conjunction with CSS (Cascading Style Sheets) to control the visual appearance of the page.

Void tags, like <img>, don’t have any children, and are written <img />, not <img></img>. Since they have no content, attributes are more important, and img has three that are used with almost every image: src (where the image lives), width, and height.

Because < and > have special meanings in HTML, you can’t write them directly. Instead you have to use the HTML escapes: > and <. And since those escapes use &, if you want a literal ampersand you have to escape it as &.

21.2.1 Goal

Our goal is to make it easy to generate HTML from R. To give a concrete example, we want to generate the following HTML:

<body>
  <h1 id='first'>A heading</h1>
  <p>Some text &amp; <b>some bold text.</b></p>
  <img src='myimg.png' width='100' height='100' />
</body>

Using the following code that matches the structure of the HTML as closely as possible:

with_html(
  body(
    h1("A heading", id = "first"),
    p("Some text &", b("some bold text.")),
    img(src = "myimg.png", width = 100, height = 100)
  )
)

This DSL has the following three properties:

The nesting of function calls matches the nesting of tags.
Unnamed arguments become the content of the tag, and named arguments become their attributes.
& and other special characters are automatically escaped.

21.2.2 Escaping

Escaping is so fundamental to translation that it’ll be our first topic. There are two related challenges:

In user input, we need to automatically escape &, < and >.
At the same time we need to make sure that the &, < and > we generate are not double-escaped (i.e. that we don’t accidentally generate &amp;, &lt; and &gt;).

The easiest way to do this is to create an S3 class (Section 13.3) that distinguishes between regular text (that needs escaping) and HTML (that doesn’t).

html <- function(x) structure(x, class = "advr_html")

print.advr_html <- function(x, ...) {
  out <- paste0("<HTML> ", x)
  cat(paste(strwrap(out), collapse = "\n"), "\n", sep = "")
}

We then write an escape generic. It has two important methods:

escape.character() takes a regular character vector and returns an HTML vector with special characters (&, <, >) escaped.
escape.advr_html() leaves already escaped HTML alone.

escape <- function(x) UseMethod("escape")

escape.character <- function(x) {
  x <- gsub("&", "&amp;", x)
  x <- gsub("<", "&lt;", x)
  x <- gsub(">", "&gt;", x)

  html(x)
}

escape.advr_html <- function(x) x

Now we check that it works

escape("This is some text.")
#> <HTML> This is some text.
escape("x > 1 & y < 2")
#> <HTML> x &gt; 1 &amp; y &lt; 2

# Double escaping is not a problem
escape(escape("This is some text. 1 > 2"))
#> <HTML> This is some text. 1 &gt; 2

# And text we know is HTML doesn't get escaped.
escape(html("<hr />"))
#> <HTML> <hr />

Conveniently, this also allows a user to opt out of our escaping if they know the content is already escaped.

21.2.3 Basic tag functions

Next, we’ll write a one-tag function by hand, then figure out how to generalise it so we can generate a function for every tag with code.

Let’s start with <p>. HTML tags can have both attributes (e.g., id or class) and children (like <b> or <i>). We need some way of separating these in the function call. Given that attributes are named and children are not, it seems natural to use named and unnamed arguments for them respectively. For example, a call to p() might look like:

p("Some text. ", b(i("some bold italic text")), class = "mypara")

We could list all the possible attributes of the <p> tag in the function definition, but that’s hard because there are many attributes, and because it’s possible to use custom attributes. Instead, we’ll use ... and separate the components based on whether or not they are named. With this in mind, we create a helper function that wraps around rlang::list2() (Section 19.6) and returns named and unnamed components separately:

dots_partition <- function(...) {
  dots <- list2(...)
  
 if (is.null(names(dots))) {
  is_named <- rep(FALSE, length(dots))
} else {
  is_named <- names(dots) != ""
}
  
  list(
    named = dots[is_named],
    unnamed = dots[!is_named]
  )
}

str(dots_partition(a = 1, 2, b = 3, 4))
#> List of 2
#>  $ named  :List of 2
#>   ..$ a: num 1
#>   ..$ b: num 3
#>  $ unnamed:List of 2
#>   ..$ : num 2
#>   ..$ : num 4

We can now create our p() function. Notice that there’s one new function here: html_attributes(). It takes a named list and returns the HTML attribute specification as a string. It’s a little complicated (in part, because it deals with some idiosyncrasies of HTML that I haven’t mentioned here), but it’s not that important and doesn’t introduce any new programming ideas, so I won’t discuss it in detail. You can find the source online if you want to work through it yourself.

source("dsl-html-attributes.r")
p <- function(...) {
  dots <- dots_partition(...)
  attribs <- html_attributes(dots$named)
  children <- map_chr(dots$unnamed, escape)

  html(paste0(
    "<p", attribs, ">",
    paste(children, collapse = ""),
    "</p>"
  ))
}

p("Some text")
#> <HTML> <p>Some text</p>
p("Some text", id = "myid")
#> <HTML> <p id='myid'>Some text</p>
p("Some text", class = "important", `data-value` = 10)
#> <HTML> <p class='important' data-value='10'>Some text</p>

21.2.4 Tag functions

It’s straightforward to adapt p() to other tags: we just need to replace "p" with the name of the tag. One elegant way to do that is to create a function with rlang::new_function() (Section 19.7.4), using unquoting and paste0() to generate the starting and ending tags.

tag <- function(tag) {
  new_function(
    exprs(... = ),
    expr({
      dots <- dots_partition(...)
      attribs <- html_attributes(dots$named)
      children <- map_chr(dots$unnamed, escape)

      html(paste0(
        !!paste0("<", tag), attribs, ">",
        paste(children, collapse = ""),
        !!paste0("</", tag, ">")
      ))
    }),
    caller_env()
  )
}
tag("b")
#> function (...) 
#> {
#>     dots <- dots_partition(...)
#>     attribs <- html_attributes(dots$named)
#>     children <- map_chr(dots$unnamed, escape)
#>     html(paste0("<b", attribs, ">", paste(children, collapse = ""), 
#>         "</b>"))
#> }

We need the weird exprs(... = ) syntax to generate the empty ... argument in the tag function. See Section 18.6.2 for more details.

Now we can run our earlier example:

p <- tag("p")
b <- tag("b")
i <- tag("i")
p("Some text. ", b(i("some bold italic text")), class = "mypara")
#> <HTML> <p class='mypara'>Some text. <b><i>some bold italic
#> text</i></b></p>

Before we generate functions for every possible HTML tag, we need to create a variant that handles void tags. void_tag() is quite similar to tag(), but it throws an error if there are any unnamed tags, and the tag itself looks a little different.

void_tag <- function(tag) {
  new_function(
    exprs(... = ),
    expr({
      dots <- dots_partition(...)
      if (length(dots$unnamed) > 0) {
        abort(!!paste0("<", tag, "> must not have unnamed arguments"))
      }
      attribs <- html_attributes(dots$named)

      html(paste0(!!paste0("<", tag), attribs, " />"))
    }),
    caller_env()
  )
}

img <- void_tag("img")
img
#> function (...) 
#> {
#>     dots <- dots_partition(...)
#>     if (length(dots$unnamed) > 0) {
#>         abort("<img> must not have unnamed arguments")
#>     }
#>     attribs <- html_attributes(dots$named)
#>     html(paste0("<img", attribs, " />"))
#> }
img(src = "myimage.png", width = 100, height = 100)
#> <HTML> <img src='myimage.png' width='100' height='100' />

21.2.5 Processing all tags

Next we need to generate these functions for every tag. We’ll start with a list of all HTML tags:

tags <- c("a", "abbr", "address", "article", "aside", "audio",
  "b","bdi", "bdo", "blockquote", "body", "button", "canvas",
  "caption","cite", "code", "colgroup", "data", "datalist",
  "dd", "del","details", "dfn", "div", "dl", "dt", "em",
  "eventsource","fieldset", "figcaption", "figure", "footer",
  "form", "h1", "h2", "h3", "h4", "h5", "h6", "head", "header",
  "hgroup", "html", "i","iframe", "ins", "kbd", "label",
  "legend", "li", "mark", "map","menu", "meter", "nav",
  "noscript", "object", "ol", "optgroup", "option", "output",
  "p", "pre", "progress", "q", "ruby", "rp","rt", "s", "samp",
  "script", "section", "select", "small", "span", "strong",
  "style", "sub", "summary", "sup", "table", "tbody", "td",
  "textarea", "tfoot", "th", "thead", "time", "title", "tr",
  "u", "ul", "var", "video"
)

void_tags <- c("area", "base", "br", "col", "command", "embed",
  "hr", "img", "input", "keygen", "link", "meta", "param",
  "source", "track", "wbr"
)

If you look at this list carefully, you’ll see there are quite a few tags that have the same name as base R functions (body, col, q, source, sub, summary, table). This means we don’t want to make all the functions available by default, either in the global environment or in a package. Instead, we’ll put them in a list (like in Section 10.5) and then provide a helper to make it easy to use them when desired. First, we make a named list containing all the tag functions:

html_tags <- c(
  tags %>% set_names() %>% map(tag),
  void_tags %>% set_names() %>% map(void_tag)
)

This gives us an explicit (but verbose) way to create HTML:

html_tags$p(
  "Some text. ",
  html_tags$b(html_tags$i("some bold italic text")),
  class = "mypara"
)
#> <HTML> <p class='mypara'>Some text. <b><i>some bold italic
#> text</i></b></p>

We can then finish off our HTML DSL with a function that allows us to evaluate code in the context of that list. Here we slightly abuse the data mask, passing it a list of functions rather than a data frame. This is quick hack to mingle the execution environment of code with the functions in html_tags.

with_html <- function(code) {
  code <- enquo(code)
  eval_tidy(code, html_tags)
}

This gives us a succinct API which allows us to write HTML when we need it but doesn’t clutter up the namespace when we don’t.

with_html(
  body(
    h1("A heading", id = "first"),
    p("Some text &", b("some bold text.")),
    img(src = "myimg.png", width = 100, height = 100)
  )
)
#> <HTML> <body><h1 id='first'>A heading</h1><p>Some text &amp;<b>some
#> bold text.</b></p><img src='myimg.png' width='100' height='100'
#> /></body>

If you want to access the R function overridden by an HTML tag with the same name inside with_html(), you can use the full package::function specification.

21.2.6 Exercises

The escaping rules for <script> tags are different because they contain JavaScript, not HTML. Instead of escaping angle brackets or ampersands, you need to escape </script> so that the tag isn’t closed too early. For example, script("'</script>'"), shouldn’t generate this:
```
<script>'</script>'</script>
```
But
```
<script>'<\/script>'</script>
```
Adapt the escape() to follow these rules when a new argument script is set to TRUE.
The use of ... for all functions has some big downsides. There’s no input validation and there will be little information in the documentation or autocomplete about how they are used in the function. Create a new function that, when given a named list of tags and their attribute names (like below), creates tag functions with named arguments.
```
list(
  a = c("href"),
  img = c("src", "width", "height")
)
```
All tags should get class and id attributes.
Reason about the following code that calls with_html() referencing objects from the environment. Will it work or fail? Why? Run the code to verify your predictions.
```
greeting <- "Hello!"
with_html(p(greeting))

p <- function() "p"
address <- "123 anywhere street"
with_html(p(address))
```
Currently the HTML doesn’t look terribly pretty, and it’s hard to see the structure. How could you adapt tag() to do indenting and formatting? (You may need to do some research into block and inline tags.)