TapirMD (Tapir's Markdown) Format Specification

The specification is written in TapirMD (source is available here).

TapirMD is a powerful, next-generation markup language that simplifies content creation. It builds on Markdown's straightforward syntax, offering enhanced specificity and greater control over formatting ^[1].

While inspired by Markdown, TapirMD is not directly compatible. It's designed to generate rich HTML content, including interactive UI elements like tabs and accordion panels. These elements can be implemented using pure HTML and CSS, eliminating the need for JavaScript ^[2].

TapirMD's syntax is both human-readable and machine-parsable, making it a flexible and efficient tool for content creation.

The recommended file extension for TapirMD documents is .tmd.

Table of contents

Terminologies, Rules and Semantics

Character and character sequences

A line end in a TapirMD document is defined as one of the following:

A character sequence consisting of
- An ASCII Carriage Return character (Unicode: U+000D), followed by
- An ASCII Line Feed character (Unicode: U+000A).
A single ASCII Line Feed character that doesn't follow an ASCII Carriage Return character.
The end of the document if it doesn't end with either of the above two cases.

A whitespace character is defined as any of the following:

An ASCII Space character (Unicode: U+0020).
An ASCII Horizontal Tab character (Unicode: U+0009).
A CJK Space character (Unicode: U+3000).

A blank character is defined as any of the following:

The ASCII DEL character (Unicode: U+007F).
Any ASCII character with a Unicode value in the inclusive range U+0000 to U+0020.

Most blank characters are invisible in popular text editing software.

A blank character sequence is defined as a sequence of blank characters and can contain at most one line end. If it contains a line end, it must end with the line end.

A perceivable blank character sequence s defined as a blank character sequence that satisfies at least one of the following conditions:

it contains at least one whitespace character.
it ends with a line end.
Perceivable blank character sequences of this case are specifically called line-end blank character sequences.

Blocks, lines, and tokens

TapirMD uses ASCII punctuation characters as mark characters.

Each TapirMD document is a plain text file that intermixes content, marks, and blanks.

After parsing, a TapirMD document is composed of a sequence of various blocks, which form a hierarchical structure.

Each block consists of one or more lines.

Each line ends with a line-end blank character sequence.

TapirMD documents are parsed line by line. The TapirMD format is carefully designed to allow each document to be parsed in a single pass.

After parsing, each line is divided into one or more tokens (text segments), such as content tokens, mark tokens and blank tokens. Tokens cannot cross lines.

Each blank token represents a sequence of blank characters.

Specifically, if the sequence of blank characters is perceivable, the corresponding blank token is called a perceivable blank token.
More specifically, if the sequence of blank characters ends with a line-end blank character sequence, the corresponding blank token is called a line-end blank token.

A line

Always ends with a line-end blank token.
May begin with an optional blank token.
Never contains consecutive blank tokens. If consecutive blank tokens do exist within a line, they are merged into a single blank token.

Each mark token consists of one or more punctuation mark characters and an optional blank character sequence which is either before or after the punctuation mark characters.

Generally, a content token contains visible characters (including whitespace characters), but it may contain invislbe blank characters.

Overview of block types

TapirMD supports a variety of block types, categorized into three groups:

atom blocks, including
- blank blocks
- usual blocks
- header blocks
- link (definition) blocks
- seperator blocks
- attribute blocks
- code blocks
- custom (data) blocks
Atom blocks are the most basic block type and cannot contain other blocks. They can't nest other blocks and they can be directly nested within both base blocks and predefined container blocks (with an exception that blank blocks can't be directly nested in non-item predefined container blocks).
predefined container blocks
- (list) item blocks
- table blocks
- quotation blocks
- notice blocks
- reveal blocks
- plain blocks
Except item blocks, predefined container blocks can only directly nest base or non-blank atom blocks and must be directly nested within a base block.

Item blocks can directly nest base, any atom, and other item blocks. More details of item blocks will be described below.
base (container) blocks, including
- explicit base blocks
- doc blocks
Base blocks have a dual role, functioning as both atom blocks and container blocks. They can directly nest any block type. They can be directly nested within predefined container blocks and other base blocks.

The root block of a TapirMD document is always a doc block. TapirMD might support document nesting later so that doc blocks may be also nested.

Explicit base blocks are bounded by explicit open and close lines, while doc blocks encompass all document lines.

Base blocks can be nested within one another. Every block, except for the root doc block, has a parent base block, the innermost base block containing the block. The parent base block is the innermost base block that contains the block. This parent base block may or may not be the block's direct parent.

TapirMD supports list nesting within a parent base block. Within a parent base block,

item predefined container blocks can directly nest not only atom and base blocks, but also item predefined container blocks at higher levels.
first-level item predefined container blocks must be directly nested within their parent base block.
non-first-level item predefined container blocks must be directly nested within another item built-in block at a lower level.

Data lines and syntaxable lines

Code and custom data blocks are explicitly defined by start and end boundary lines. The lines between the boundary lines of a custom data block are referred to as data lines. Similarly, the lines between the boundary lines of a code block are called code lines. In essence, code blocks can be viewed as a type of data block, making code lines a subset of data lines.

Data lines are guaranteed to contain no TapirMD marks. Non-data lines, which may or may not contain TapirMD marks, are referred to as syntaxable lines.

Line ends of data lines are always viewed as a single ASCII Line Feed character, even if they are not.

The line-end blank token and the optional start blank token of syntaxable lines are both ignored in the HTML output, meaning that indentations in TapirMD have no semantic meaning.

Blank blocks

A syntaxable line that consists of only one token (the line-end blank token) is called a blank line.

A sequence of consecutive blank lines forms a blank block.

Blank blocks should be rendered as bare <p> elements in the HTML output.

Start and end of non-item predefined container blocks

During the line-by-line parsing process, a non-item predefined container block starts at a syntaxable line that begins with: which begins with

an opitional blank token followed by
a predefined-container-leading mark token which
- begins with a one-character mark and
  - ends with a perceivable blank character sequence or
  - is followed by a line-end blank token.

The character in the leading mark token is

# for table blocks,
> for quotation blocks,
! for notice blocks,
? for reveal blocks,
. for plain blocks.

During line-by-line parsing, a non-item predefined container block will end

before a blank block, or
before a following predefined container block (of either item type or not), or
before its parent base block explicitly closes, or
at the end of the containing document.

A non-item predefined container block is always directly nested within its parent base block.

Start and end of item predefined container blocks

For simplicity, we will refer to item predefined container blocks as item blocks from here on.

Item blocks share some common rules with other predefined container blocks. However, since TapirMD supports list nesting in a parent base block, the rules for item blocks are somewhat more complex.

During line-by-line parsing, an item block starts at a syntaxable line which begins with

an opitional blank token followed by
a one-character or two-character item predefined-container-leading mark token which
- ends with a perceivable blank character sequence, or
- is followed by a line-end blank token.

The character or character sequence in the leading mark token may be

*, +, -, ~ for unordered lists, and
*., +., -., ~. for ordered lists, and
:, :. for definition lists.

A sequence of consecutive sibling item blocks form a list. All the item blocks in a list must share the same leading mark. The same leading mark is called the mark of the list.

A list opens when the first item block starts and closes when its last item block ends.

TapirMD supports list nesting within a parent base block. During line-by-line parsing, the parser tracks opening nested lists within each parent base block. Lists opened earlier have lower levels than those opened later. Lower-level lists nest inside higher-level lists.

When an item block starts,

if it is found that its leading mark is the same as the mark of an opening list in its parent base block, then the item block is viewed as an item in the opening list. And the item block is viewed as the sibling block of the seen last item block in the opening list.

If the opening list is nesting inside higher-level lists, then all of those higher-level lists close.

If the previous block of the item block is a blank block, then the blank block will be viewed as a direct child of the seen last item block in the opening list.

The item block now is treated as the seen last item block in the opening list.
if its leading mark is different from any marks of the opening lists within the parent base block, then a new list with a higher level opens and the item block is treated as the first and the seen last item block in the new opening list.

If the new opening list is the only opening list tracked by the parser within the parent base block, then the list is called a first-level list. The item blocks of a first-level list are all direct children of the parent base block.

All opening lists will close

before a blank block followed by a non-item block, or
before a non-item predefined container block, or
before the parent base block explicitly closes, or
at the end of the containing document.

When a list closes, its seen last item block ends. The item block is confirmed as the last item block of the list.

About child blocks of base and predefined container blocks

Every predefined container block (of either item type or not) directly nests at least one atom and base blocks (a.k.a. has at least one child).

A base block may have no children.

Base or non-blank atom blocks may open or start at the same lines of predefined container blocks.

If the remaining part of the start line of a predefined container block after the leading mark token has characteristics of base block open line or atom block start line, then a base block or atom block opens or starts at the same line with the predefined container block.
Otherwise, a usual block starts at the same line with the predefined container block, even if the start line of the usual block contains nothing.

During line-by-line parsing, when an atom starts or a base block opens,

if the last block is a blank block, then the atom block or base block, alongside with that blank block, is treated as the direct child of the parent base block.
if the last block is a non-blank atom block, then the atom block or base block shares the same parent block (either a base block or a predefined container block) with that non-blank atom block.
if the last block is a predefined container block and no children have been detected for the predefined container, then the atom block or base block is treated as the (first) direct child of the predefined container block.
similarly, if the last block is an opening base block and no children have been detected for the opening base block, then the atom block or base block is treated as the (first) direct child of the opening base block.

In the following sections, the rule descriptions for opening base blocks and starting atom blocks all ignore leading mark tokens of predefined container blocks.

Start and end of explciit base blocks

During line-by-line parsing, an explciit base block

opens at a syntaxable line beginning with a base-open-leading mark token, which is a character sequence containing one or more consecutive { characters.
closes at
- a syntaxable line beginning with one or more consecutive } characters (a base-close-leading mark token), or
- at the end of the containing document.

The numbers of the } characters in the base-close-leading mark token and the numbers of the { characters in the base-open-leading mark token are not required to match.

On the open line of an explciit base block, multiple optional attribute tokens may follow the base-open-leading mark token, to set some attributes for the explciit base block. The optional tokens are seperated by perceivable blank tokens, and they must be in the following order (from top to bottom) if present:

//
<< >> >< <>
^^
..N:M ..N :M

Here,

// means the explciit base block is commented out and will not be rendered in HTML output. However, the internal of the explciit base block will still be parsed.
<< >> >< <> are four text horizontal alignment tokens. At most one of them can present.
- << means left-aligned,
- >> means right-aligned,
- >< means center-aligned,
- <> means justify-aligned.
The text alignment tokens define the text align of the explciit base block.
^^ is a text vertical alignment token. It is only meaningful when the explciit base block is used as a table cell. It means the table cell is top aligned in vertical. By default, table cells are middle aligned in vertical.
..N:M ..N :M are three table cell span count tokens. At most one of them can present. They are only meaningful when the explciit base block is used as a table cell. N and M denote positive integers.
- ..N means N cells span along the major axis of the innermost containing table.
- :M means M cells span along the minor axis of the innermost containing table.

A TapirMD parser should try to parse as many attribute tokens as possible. The remaining un-parsed texts are ignored.

Currently, the text after the base-close-leading mark token in the close line of an explciit base block are all ignored.

Base blocks should be rendered as <div> elements in HTML output.

Attribute blocks

A syntaxable line is a attribute line if it begins with an attribute line leadng mark token, which

begins with three or more consecutive @ characters
and ends with an optional blank character sequence.

A sequence of consecutive attribute lines form a attribute block.

On an attribute line, multiple optional attribute tokens may follow the attribute line leadng mark token, to set some attributes for the next sibling block of the containing attribute block, if the next sibling block exists. The optional tokens are seperated by perceivable blank tokens, and they must be in the following order (from top to bottom) if present:

#id
.class1;class2

Here,

#id specifies a block ID (id can be any valid HTML4 ID identifier).
.class1;class2 specifies some classes (class1 and class2 can any valid HTML4 class name identifers).

A TapirMD parser should try to parse as many attribute tokens as possible. The remaining un-parsed texts are ignored.

Warning!

The token format for multiple class names might change.

If an attribute is defined more than once in multiple lines in an attribute block, the first definition is chosen.

If an attibute block has not a next sibling block but a previous sibling block, then the previous sibling block will be wrapped in an (implicit) footer block, and the attributes defined in the attibute block are set on the footer block.

ToDo:

If an attribute block has no sibling blocks, then the attributes defined in the block are for the containing document. Such attirbute blocks should be placed at document beginning.

The classes attributes are just a HTML things, but the ID attributes of blocks are used in TapirMD for various purposes.

Usual blocks

A syntaxable line is a usual line if it begins with a usual block leadng mark token, which

begins with three or more consecutive ; characters
and ends with an optional blank character sequence.

A new usual block will always start at such a usual line. If the usual block leading mark token is followed by a line-end blank token, then the line is rendered as a blank block in HTML output.

If a syntaxable line doesn't begin with any identifiable block leading tokens, the line is also treated as a usual line, called a plain usual line. For a plain usual line,

if it is the first line of a predefined container block, a new usual block starts at the plain usual line.
it has a previous line and the previous line is a block boundary line or a blank line, a new usual block starts at the plain usual line.
it has a previous line and the previous line is a usual/header/link line, then the two lines belong to the same atom block (which might be a usual/header/link block).
otherwise, a new usual block starts at the plain usual line.

Note that a usual block without non-blank tokens has alternative semantic when it is the first child block of a table block,

Usual blocks should be rendered as <div> elements in HTML output.

Header blocks

A syntaxable line beginning with three consecutive # characters is a header line. If the three # characters are followed by

one or more consecutive = characters, a second-level header block starts at the header line, or
one or more consecutive + characters, a three-level header block starts at the header line, or
one or more consecutive - characters, a fourth-level header block starts at the header line, or
zero or more consecutive # characters, a first-level header block starts at the header line.

A header block leading mark token

begins with such a leading character sequences containing #=+- characters,
and ends with an optional blank character sequence.

Multiple optional plain usual lines can follow a header line and also belong to the same header block starting at the header line.

A header block with only one non-blank token (its header block leading mark token) is called a bare header block.

First-level non-bare header blocks are generally used for document titles. When there are more than one first-level non-bare headers in a TapirMD document and no external title is provided, then the first one is used as the document title block, others will be treated as section titles. In HTML output, the font size of the title block should be larger than section titles.

Bare header blocks is rendered as a TOC (table of contents) block in HTML output. Generally, a TapirMD document should contain only one bare header block. A Nth-level bare header block implies that all section titles from level one to level N (inclusive) will be listed in TOC.

The section titles contained in predefined container blocks will never be listed in TOC.

Note that first-level header blocks have differnt semantics when they are the first non-attribute children of predefined container blocks.

Style and controlling marks

Besides the block leading mark token (if it exists), each line within a usual/header/link block may contain all kinds of style and controlling mark tokens. These mark tokens can help content creators achieve text styling, hyperlinks, media showing, line spacing, mark character escaping, etc.

The usual lines in header and usual blocks may contain various style and formatting tokens. These tokens enable content creators to apply a wide range of effects, such as:

text styling, including
- bold and dimmed
- italic and revert-italic
- underline and dotted underline
- strikethrough and text hiding
- smaller and larger font size
- subscript and superscript
- text marking
- code spans and mono-font spans
hyperlinks
media embedding
line comments
line breaks
line-end spacing (whether or not generate a space character between two neighbor lines)
(mark) character escaping

There are two groups of style and control mark tokens: line-leading mark tokens and non-line-leading mark tokens.

line-leading mark tokens

A line-leading mark token must appear at the beginning of a line to take effect. All line-leading mark tokens

begin with exact two identical characters
and end with a perceivable blank token.

Here is the list of all line-leading mark tokens supported now.

Token Types	Leading Characters	Explanation
mark-escaping token	`!!`	Within the containig line (called an mark-escaped line), the text following the perceivable blank token is guaranteed to not contain other mark tokens.
spoiler token	`??`	Within the containig line (called a spolier line), the text following the perceivable blank token is hidden in generated HTML. Note, the text is also mark-escaped. the text is used for spoiler purpose, not for security purpose, such as storing passwords. the text should be initially invisible in browsers, and may become visible after specific user interactions, such as selection.
media-embedding token	`&&`	Within the containig line (called a media-embedding line), the text following the perceivable blank token is also mark-escaped. Currently, the text must be a valid image URI, whether relative or absolute. Note: If the media-embedding line is not the only content in the containing block, the specified media should be displayed using these CSS properties: `height: 1em; vertical-align: middle;`. A text is a valid image URI if it ends with the following extensions (ignore case): `.png` `.gif` `.jpg` `.jpeg` NOTE: The image URI validation rules might be adjuested with more details later.
line-break token	`\\`	A line-break token which is equivalent to `<br>` in HTML.
line-comment token	`//`	Within the containig line (called a comment line), the text following the perceivable blank token is mark-escaped unless it exhibits the characteristics of a link definition. Link definitions are specified in a following section. Comment lines don't contain content tokens.

Line-leading mark tokens take higher precedence over all non-line-leading mark tokens.

even-backtick mark tokens

Even-backtick mark tokens, just as the name implies, comprise even number of backtick (`) characters.

Even-backtick mark tokens can operate in a secondary mode. In secondary mode, an even-backtick mark token begins with an additional ^ (caret) character.

Even-backtick mark tokens are used to denote various special characters or character sequences.

An even-backtick mark token in primary mode and with exact one pair of backticks is treated as a void character and rendered as nothing in HTML output.
An even-backtick mark token in primary mode and with more than one pair of backticks is treated as a non-collapsable space sequence. The number of non-collapsable spaces in the sequence is the pair count minus one.
An even-backtick mark token in secondary mode is treated as backtick character sequence, with the number of backticks in the sequence equal to the pair count.

Even-backtick mark tokens take higher precedence over other non-line-leading mark tokens. The next section will talk more about this rule.

Below, we call other non-line-leading mark tokens as style mark tokens.

style mark tokens

Each style mark token type is asccociated with a specified ASCII punctuation character. The character is called the mark character of that style type.

Style mark tokens have opening and closing semantics. In a usual or header block, the odd-numbered occurrences of a style type are treated as opening style mark tokens, while the even-numbered occurrences are treated as closing style mark tokens. The mark character count in a closing style mark token must match the previous opening style mark token of the same type.

Similar to even-backtick mark tokens, opening style mark tokens can also operate in a secondary mode. In secondary mode, opening style mark tokens also begin with an additional ^ character.

All style types are listed in the following table.

Style Type	Mark Character	Primary Mode Semantic	Secondary Mode Semantic
font-face	`	code span	mono font
font-weight	`*`	bold	dimmed
font-style	`%`	italic	revert italic
font-size	`:`	smaller	larger
text-deletion	`~`	strikethrough	hide (but still occupy space)
text-marking	`\|`	hightlight	hightlight (with mistake smell)
sub/sup	`$`	subscript	superscript
link/underline	`_`	link	underline

Mark tokens of the font-face style type are required to have exact one mark character (`), while mark tokens of other style types are required to be in the inclusive range [2, 7].

An opening style mark token may end with a non-line-end blank character sequence. A closing style mark may begins with a blank character sequence.

Style mark tokens function as style toggle switches. Within a usual or header block, an opening mark token of a specific style type activates that style. The style is deactivated when either the corresponding closing mark token is encountered or the end of the block is reached. Before deactivation, additional style mark tokens of the same type are ignored (escaped) if their character count does not match the opening mark token, ensuring they do not deactivate the style prematurely. All content tokens between when the style is activated and when it is deactivated form a text span of the style.

Style mark tokens in the secondary mode (code span) of the font-face type take precedence over other style mark tokens. This means that, within a usual or header block, when the code style is activated, mark tokens of other style types are temporarily ignored (escaped) until the code style is deactivated.

The previous section mentioned that even backtick mark tokens take higher precedence over other non-line-leading mark tokens. What does this rule mean? It means:

A sequence of backticks with an even number of characters will be interpreted as an even-backtick mark token.
A sequence of backticks with an odd number of characters will be interpreted as an even-backtick mark token followed by a code span mark token.

Due to the rules outlined above in TapirMD, text spans with different styles may intersect. When generating HTML, some text spans may need to be split into smaller pieces. However, TapirMD is carefully designed to ensure that text spans with link or code styles never need to be split apart.

links and footnotes

Below, we will refer to text spans with link style as link spans (or just links).

A not-empty link span will be rendered as a hyperlink in HTML output.

If the link span contains only one content token, that token is used as the text of the hyperlink.
If the link span contains multiple content tokens, all tokens except the last one are used as the link text of the hyperlink.

The URL for a hyperlink can be defined either inside or outside the corresponding link span.

If the last content token of a link span is a valid URL (see below), it is used as the hyperlink's URL. We call the link self-defined.
Otherwise, a matching link (definition) block will be searched to provide the URL (the matching rules are described below).
- If a match is found, then the hyperlink's URL is determined by the matched link (definition) block.
  - If the matched link (definition) block succeeds to specifiy a URL, that URL is used as the hyperlink's URL.
  - Otherwise, the hyperlink is viewed as broken.
- If no matching blocks are found, the last content token in the corresponding link span is treated as the argument of a custom URL generator.
  - If the URL is successfully generated, the generated URL is used as the hyperlink's URL.
  - Otherwise, the hyperlink is viewed as broken.

A content token is a valid URL if its text

starts with http:// (ignore case), or
starts with https:// (ignore case), or
ends with .htm[#fragment] (ignore case), or
ends with .html[#fragment] (ignore case), or
is #[fragment].

[...] means an optional part here.

If a hyperlink span only contains a #fragment token, then the hyperlink span is viewed as a footnote reference. The corresponding footnote is defined in the block specified with ID as fragment. Generally, footnote definition blocks should be placed in an explcit base block which is commented out. Footnote blocks will be always rendered at the end of HTML output.

Link (definition) blocks

A syntaxable line beginning with three or more consecutive = characters is a link (definition) line (or simply link line). Each link (definition) block (or simply link block) always starts with such a line.

Multiple optional plain usual lines can follow a link line and also belong to the same link block starting with the link line.

The sequence of the leading consecutive = characters of a link block is called a link block leading mark token.

The lines within a link block never contain link spans. In other words, __ character sequences will be viewed as content characters in link blocks.

A link block with only one non-blank token (its link block leading mark token) is called a bare link block.

Link blocks are used to specify URL for non-self-defined link spans. Each link block can specify one URL for multiple link spans satisfying certain patterns.

A link block must contain at least two content tokens to specify a URL.

If the last content token of a link block is a valid URL, then the URL is what is specified.
Othewise, the last content token is treated as the argument of a custom URL generator.
- If the URL is successfully generated, then the URL is what is specified.
- Otherwise, the link block fails to specify a URL.

For a (non-self-defined) link span, the link blocks after it have higher matching priority than those before it. And

for the ones after the link span, earlier ones have priority over later ones. But the ones after a bare link block which is after the link span will never get matched.
for the ones before the link span, later ones have priority over earlier ones. But the ones before a bare link block which is before the link span will never get matched.

How matching texts are generated:

All content tokens of a link span are combined into a single matching text, with all blank characters removed.
All content tokens of a link block, except the last one, are combined into a single matching text, with all blank characters removed.

How matching works depends on the structure of matching texts of link blocks:

If the matching text of a link block consists solely of three dots (...), then the link block matches all link spans.
If the matching text ends with three dots, prefix matching is performed.
If the matching text begins with three dots, suffix matching is performed.
Otherwise, exact matching is performed.

Line-end spacing rules

A line end in a usual or header block may be ignored or rendered as an ASCII Space character in HTML output.

Line ends of comment lines and media-embedding lines are always ignored in HTML output.

In a usual or header block, for a line which is neither a comment line nor a media-embedding line, its line end is rendered as an ASCII Space character unless any of the following cases happens:

The line has an opening style mark token followed by a line-end blank token.
The line has no content tokens.
(Note: Even-backtick mark tokens are treated as content tokens.)
The last content token in the line ends with a blank or CJK character ^[3].
(Note: Even-backtick mark tokens in primary mode are interpreted as CJK characters.)
Within the block, after the line, no more content tokens are found.
After the line and before the next content token, a media-embedding or line-break token is found.
The next content token begins with a blank or CJK character.
(Again, even-backtick mark tokens in primary mode are interpreted as CJK characters.)

NOTE:

The current line-end spacing rules are not perfect and may be adjusted later. If it turns out that making the rules overly complex is necessary to achieve perfection, then the rules will have been made imperfect for their intended purpose.

Seperator blocks

A syntaxable line is a seperator line if it begins with a seperator leadng mark token, which comprises three or more consecutive - characters followed by a line-end blank token.

Each seperator line forms a seperator block.

Generally, a seperator block should be rendered as horizontal rule (the <hr> element). However, please note that seperator blocks directly nested in table blocks have alternative semantics.

Code blocks

During line-by-line parsing, a code block

starts at a syntaxable line beginning with a code-block-leading mark token, which is a character sequence containing one or more consecutive ' (single quotation, not backtick) characters. The line is the start boundary line of the code block.
ends at
- a later syntaxable line (the end boundary line) beginning with a code-block-leading mark token, which contains the same number of ' characters as the corresponding code-block-leading mark token in the start boundary line, or
- the end the document. For such case, the code block doesn't have the end boundary line.

The lines except boundary lines in a code block are called code (data) lines. In HTML output, the line ends of code lines are always viewed as an ASCII Line Feed character, even if they are not.

The main purepose of code blocks is to show some raw text lines, especially programming language code snippets.

On the start boundary line of a code block, multiple optional attribute tokens may follow the code-block-leading mark token, to set some attributes for the code block. The optional tokens are seperated by perceivable blank tokens, and they must be in the following order (from top to bottom) if present:

//
language

Here,

// means the code block is commented out and will not be rendered in HTML output.
language means a programming language name, such as zig, c, go, etc. HTML renderers may use the language name to add class names for the code block.

A TapirMD parser should try to parse as many attribute tokens as possible. The remaining un-parsed texts are ignored.

On the end boundary line of a code block, multiple optional tokens may follow the code block end leading mark token, to stream the TapirMD source to the code block. The optional tokens are seperated by perceivable blank tokens, and they must be in the following order (from top to bottom) if present:

<<
#id

Here,

<< just implies the streaming directtion.
#id specifies the block to be streamed.

A TapirMD parser should try to parse as many attribute tokens as possible. The remaining un-parsed texts are ignored.

The two supported tokens must be both present to make the streaming meaningful. The explicit boundary lines of the block to be streamed will be excluded in streaming.

Custom (data) blocks

During line-by-line parsing, a custom block

starts at a syntaxable line beginning with a custom-block-leading mark token, which is a character sequence containing one or more consecutive " (double quotation) characters. The line is the start boundary line of the custom block.
ends at
- a later syntaxable line (the end boundary line) beginning with a custom-block-leading mark token, which contains the same number of " characters as the custom-block-leading mark token in the start boundary line, or
- the end of the document. For such case, the custom block doesn't have the end boundary line.

The lines except boundary lines in a custom block are called data lines. In HTML output, the line ends of custom lines are always viewed as an ASCII Line Feed character, even if they are not.

The main purepose of custom blocks is to extend TapirMD by supporting user data blocks.

On the start boundary line of a custom block, multiple optional attribute tokens may follow the custom-block-leading mark token, to set some attributes for the custom block. The optional tokens are seperated by perceivable blank tokens, and they must be in the following order (from top to bottom) if present:

//
app

Here,

// means the custom block is commented out and will not be rendered in HTML output.
app means aa application name, An application might be
- a built-in application, or
- a user plugin.

A TapirMD parser should try to parse as many attribute tokens as possible. The remaining un-parsed texts are ignored.

Currently, user plugins are not supported yet. And html is the only supported built-in application.

⚠ Warning!

Be careful when using the built-in html application. All the data lines in a html custom block will be written as is in HTML output.

Currently, the text after the custom-block-leading mark token in the end boundary line of a custom block are all ignored.

List semantics

If the mark of a list is : or :., then the list is treated as a definition list in HTML output. It is recommended to use two different styles for definitions lists beginning with different marks.

For an item block in a definition list,

if the first non-attribute child block of the item block is a first-level header block, then the header block is treated as the definition title, and the other children are treated as the definition body.
otherwise, the definition title is viewed as missing and all the children are treated as the definition body.

(definition list examples)

Render Result

A definition list with the : mark:

Term 1: Descriptions of term 1.
Term 2: Descriptions of term 2.

A definition list with the :. mark:

Term 1: Descriptions of term 1.
Term 2: Descriptions of term 2.

This is an indented block. It is actually a definition item block without title.

TapirMD Source

      A definition list with the `:` mark:
      :  ### Term 1
         ;;; Descriptions of term 1.
      :  ### Term 2
         ;;; Descriptions of term 2.

      A definition list with the `:.` mark:
      :. ### Term 1
         ;;; Descriptions of term 1.
      :. ### Term 2
         ;;; Descriptions of term 2.

      @@@
      :  This is an indented block.
         It is actually a definition item block without title.

If the mark of a list is *, +, - or ~, and the first non-attribute child blocks of all its item blocks are not first-level header blocks, then the list is treated as an unordered list in HTML output.

(an unordered list example)

Render Result

Languages

Zig
- https://ziglang.org
  - Doc: https://ziglang.org/documentation
  - Downloads: https://ziglang.org/download/
- https://github.com/ziglang/zig
- https://ziggit.dev/
Go
- https://go.dev
- https://github.com/golang/go

TapirMD Source

      Languages
      *  Zig
         -  __https://ziglang.org
            ~  Doc: __https://ziglang.org/documentation
            ~  Downloads: __https://ziglang.org/download/
         -  __https://github.com/ziglang/zig
         -  __https://ziggit.dev/
      *  Go
         +  __https://go.dev
         +  __https://github.com/golang/go

If the mark of a list is *., +., -., ~., and the first non-attribute child blocks of all its item blocks are not first-level header blocks, then the list is treated as an ordered list in HTML output.

(an ordered list example)

Render Result

☐ finish the lib implementation
1. ☑ inline styles
2. ☑ block hierarchy
3. ☐ wasm
  1. Go lib (using wasm)
  2. JS lib (using wasm)
☐ write tests
☑ write specification

TapirMD Source

      *. ☐ finish the lib implementation
         +. ☑ inline styles
         +. ☑ block hierarchy
         +. ☐ wasm
            -. Go lib (using wasm)
            -. JS lib (using wasm)
      *. ☐ write tests
      *. ☑ write specification

If the mark of a list begins with *, +, - or ~, and the first non-attribute child block of one item blocks in the list is a first-level header block, then the list is treated as a tab panel in HTML output.

(a tab panel example)

Render Result

Zig

https://ziglang.org

1. Doc

https://ziglang.org/documentation

2. Downloads

https://ziglang.org/download/
https://github.com/ziglang/zig
https://ziggit.dev/

TapirMD Source

      *  ### Zig
         -  __https://ziglang.org
            ~. ### Doc
               ;;; __https://ziglang.org/documentation
            ~. ### Downloads
               ;;; __https://ziglang.org/download/
         -  __https://github.com/ziglang/zig
         -  __https://ziggit.dev/
      *  ### Go
         +  __https://go.dev
         +  __https://github.com/golang/go

Table semantics

Like other non-item predefined container blocks, the child blocks of table blocks can be either base blocks or any non-blank atom blocks.

Block Type	Role in Table	Text Alignment	More Explanation
attribute blocks	nothing	N/A	Attribute blocks in table blocks have no table-specific semantics.
seperator blocks	delimiters of table rows or columns	N/A	The child blocks in a table block are divided into multiple block groups Each block group forms a table row or column if it contains at least one table cell block.
usual blocks	table cell or table major axis specifier	center	If the first child block of a table block is a usual block containing only blank tokens, it specifies that the table is column-major. Otherwise, the table is row-major. Other usual blocks are treated as table cell blocks.
header blocks	table cell	center	Specifically, first-level header blocks are treated as table header cells.
code blocks		left
custom blocks		left
base blocks		left by default	Text alignments of explicit base table cell blocks can be configured using attribute tokens on the opening lines of explicit base blocks. Cell spans can be also configured using attribute tokens on the opening lines of explicit base blocks.

The vertical text alignment of table cells is always middle.

(a row-major table examples)

Render Result

Language	Simplicity	Readability	Powerful
Markdown	Very simple	Good	No
TapirMD	Reasonably simple	Good	Yes
AsciiDoc	Not very simple	Not very good	Yes

TapirMD Source

      #  ### Language
         ### Simplicity
         ### Readability
         ### Powerful
         ----------
         ;;; Markdown
         ;;; Very simple
         { >< :2
         Good
         }
         ;;; No
         ----------
         ;;; TapirMD
         ;;; Reasonably simple
         { >< :2
         Yes
         }
         ----------
         ;;; AsciiDoc
         ;;; Not very simple
         ;;; Not very good

(a column-major table examples)

Render Result

Language	Markdown	TapirMD	AsciiDoc
Simplicity	Very simple	Reasonably simple	Not very simple
Readability	Good		Not very good
Powerful	No	Yes

TapirMD Source

      #
         ### Language
         ### Simplicity
         ### Readability
         ### Powerful
         ----------
         ;;; Markdown
         ;;; Very simple
         { >< :2
         Good
         }
         ;;; No
         ----------
         ;;; TapirMD
         ;;; Reasonably simple
         { >< :2
         Yes
         }
         ----------
         ;;; AsciiDoc
         ;;; Not very simple
         ;;; Not very good

Quotation block semantics

A quotation block can have two different appearances, depending on whether the first non-attribute child block of the quotation block is a first-level header block. These appearances are determined by the TapirMD renderer implementation.

(a quotation block example)

Render Result

"Success is not final, failure is not fatal: It is the courage to continue that counts."

"It is never too late to be what you might have been."

TapirMD Source

      >  "Success is not final, failure is not fatal: It is the courage to continue that counts."
         {
         ;;; -- Winston Churchill
         @@@
         }
         {
         >  "It is never too late to be what you might have been."
            ;;; George Eliot
            @@@
         }

(another quotation block example)

Render Result

The best way to predict the future is to invent it.

TapirMD Source

      >  ### The best way to predict the future is to invent it.

Notice block semantics

A notice block should be rendered prominently.

If the first non-attribute child block of a notice block is a first-level header block, then the header block should be rendered as the header of the notice block.

(a notice block example)

Render Result

WARNING: The specification is not yet stable.

TapirMD Source

      !  WARNING: The specification is not yet stable.

(another notice block example, with header)

Render Result

WARNING!

The specification is not yet stable.

TapirMD Source

      !  ### WARNING!
         ;;; The specification is not yet stable.

Reveal block semantics

Initially, the content of a reveal block is hidden when loading a generated HTML from a TapirMD document. Its visibility toggles based on specific user interactions.

If the first non-attribute child block of a reveal blockis a first-level header block, the first-level header block is rendered as the always-visible title of the reveal block.

(a reveal block example)

Render Result

Zig
C/C++
Go

TapirMD Source

      ?  {
         *  Zig
         *  C/C++
         *  Go
         }

(another reveal block example, with header)

Render Result

Why TapirMD?

The main purpose of TapirMD is to intvent a powerful markup language which is both readable and easily extensible.

I believe TapirMD will boost my technical writing productivity.

TapirMD Source

      ?  ### Why TapirMD?
         {

         The main purpose of TapirMD is to intvent a powerful markup language
         which is both readable and easily extensible.

         I believe TapirMD will boost my technical writing productivity.
         }

Plain block semantics

A plain block is a simple container without specific styling. However, if its first non-attribute child block is a first-level header block, then the first-level header block is rendered with a specific header style.

(a plain block example)

Render Result

main.zig

      const std = @import("std");

      pub fn main() void {
          std.debug.print("Zig is fast as lighting.\n", .{});
      }

TapirMD Source

      .  ### main.zig
         ''' zig
      const std = @import("std");

      pub fn main() void {
          std.debug.print("Zig is fast as lighting.\n", .{});
      }
         '''

(another plain block example)

Render Result

A bare plain block is placed between the two lists to avoid them being interpreted as a single, continuous list.

TapirMD Source

      A bare plain block is placed between the two lists to
      avoid them being interpreted as a single, continuous list.

      *. foo

      *. bar

      .  // terminate the above list

      *. 123

      *. xyz

Reserved marks

The following punctuation characters are potential predefined-container-leading marks. They should be escaped when they appear at line beginning.

=
|
@
$
%
^
<
&
_ (underscore)
; 
,

The following punctuation character sequencs are potential atom block leading marks. They should be escaped when they appear at usual line beginning.

+++
,,,
...
!!!
???
///
\\\
&&&
[[[
]]]
(((
)))

The following punctuation character sequences are potential inline marks. They should be escaped in header and usual blocks.

,,
==
^^
<<
>>
@@
((
))
[[
]]

Footnotes

Markdown is known for its limited capabilities and lack of strict specification.
↩︎
Without relying on interactive UI elements, a TapirMD document can easily be converted to multiple formats beyond HTML, including EPUB and others.
↩︎
CJK is a short form standing for Chinese, Japanese, and Korean.
↩︎