Syntax of code highlighting scripts

Table of Contents

About this document
The general idea
Script syntax
Comments
States
Tokens
Delimiters
Rules
How does the parser work?
State memory and nested code
Regular expression support
Examples included with the software

About this document

This document explains how the highlighting parser works and describes the syntax of parser scripts. This information is not intended to enable you create new code highlighting scripts, but it should help you with modifying and updating the existing scripts.

Table of Contents


The general idea

Highlighting parser is an algorithm that walks through the text and marks each part of the text as a particular token of code. Each type of token is highlighted with a separate color in the code editor. Parser works according to the rules contained within the highlighting script.

In this document the theory will be explained based on a simple example of a very basic HTML highlighting script that paints everything between < and > with a different color.

Basically parser works by going through the text and searching for the next change of state that can occur in the present state, marking pieces of text (tokens) in the process.

In our very basic example, the parser would go through text searching for < symbol. Once found, it would go into "HTML tag" state and start marking everything as HTML tag and would now search for > to get out of the present state. Once it has found >, the parser would once again go into "text" state and search for the next < occurrence.

Table of Contents


Script syntax

The general syntax of a highlighting script is as follows:

// States
State=statename1
State=statename2
...
State=statenameN

// Tokens
Token=tokenname1
Token=tokenname2  {Pretty Name}
...
Token=tokennameN

// Delimiters
Delimiters=[all delimiter characters in one string]

// Rules
Rule1
Rule2
...
RuleN

Table of Contents


Comments

Comments start with // and should be placed at the beginning of a line.

Table of Contents


States

States define all possible parser states. They are used in the parsing process.

In the example of the simplified HTML code parser, there would be two parser states "inside text" and "inside htmltag", they could be defined as follows:

State=snormal
State=shtmltag

Table of Contents


Tokens

Tokens define all possible types of code. Each token can be highlighted with a separate color in the code editor.

In the example of the simplified HTML code parser, there would be two tokens tnone (for simple text) and thtmltag, they could be defined as follows:

Token=tnone
Token=thtmltag  {HTML Tag}

Optionally, for the internal name of a token, you can specify a more readable name that will be presented to the user. This is done by enclosing the "pretty" name within curly braces after the definition of the token. See the above example.

Table of Contents


Delimiters

Delimiter characters determine characters that break apart whole words. They determine how whole words are selected. Usually they are defined like this:

Delimiters=;.,:'"{}[]()<>?!@#%^&*-+=|\/

Table of Contents


Rules

Rules are the main part of the syntax parser script. They determine how the parser works. Each rule has a very simple format, yet a set of rules can hold a very complex parsing algorithm.

Each rule has the following format:

currentstate   state_change_expr   newstate   tokenname

currentstate
The state in which the parser is now.

state_change_expr
Word, character or regular expression upon which the state is changed.

newstate
The new state.

tokenname
Token with which the piece of code matched as state_change_expr should be marked.

In the example of the simplified HTML code parser, there would be three rules:

snormal    <       shtmltag    thtmltag
shtmltag   [^>]*   shtmltag    thtmltag
shtmltag   >       snormal     thtmltag

Table of Contents


How does the parser work?

The algorithm is pretty simple:

  1. Any parser starts from the very beginning of the code at the default state snormal which is a mandatory state.
  2. It looks at all the rules starting with the current state (snormal at the beginning) and if any rule is matched, it switches the state accordingly.
  3. After the state is changed, parser repeats the above step and so it goes until the end of the text.

Let's look at our very simple example:

State=snormal
State=shtmltag

Token=tnone
Token=thtmltag  {HTML Tag}

Delimiters=;.,:'"{}[]()<>?!@#%^&*-+=|\/

snormal    <       shtmltag    thtmltag
shtmltag   [^>]*   shtmltag    thtmltag
shtmltag   >       snormal     thtmltag

In this example, the parser would start with a current state of snormal. It would try to match the only rule for state snormal:

snormal   <   shtmltag   thtmltag

When a < symbol is found:

  1. The symbol is marked as HTML tag (token thtmltag is used)
  2. The current state is changed to shtmltag

After this, two rules are tested:

shtmltag   [^>]*   shtmltag    thtmltag
shtmltag   >       snormal     thtmltag

The first rule matches all symbols within the tag that are not tag closing symbols (>), marks them as HTML tag (token thtmltag is used) and leaves state to shtmltag, because the parser is still inside the HTML tag.

The second rule matches the > symbol, marks it as token of type thtmltag and changes state to snormal as the parser is exiting the HTML tag.

The order of rule processing

If multiple rules apply in a particular situation, they are processed in inverse order - from the bottom to top, thus the last rule is processed first and overrides any rules before it.

Rules should be exclusive

Whenever possible, rules should be designed so that no two rules could be valid for the same piece of text at the same time. Otherwise there is a lot of space for conflicts.

Table of Contents


State memory and nested code

States can be nested - parser can save and remember one previous state, but no more. This is achieved by using the keyword SaveState after the rule. E.g.

snormal   <?   shtmlPHP    tphptag    SaveState

In the above example, parser would remember that the PHP tag started inside the snormal. This is useful, because the the PHP tag could have also started within numerous other states and since we only know the current state, it would be impossible to get back to snormal without knowing that the PHP tag started out there.

To load the remembered state, you should use keyword LoadState after the rule. E.g.

shtmlPHP   ?>   snormal    tphptag    LoadState

In the above example the parser would return to the remembered state regardless of the new state specified in the rule. At this point parser simply does not know where the PHP tag started, so the state memory comes in handy.

Please remember that the state memory can remember only one state.

Table of Contents


Regular expression support

Rules can match a symbol, a word or a regular expression. See examples below.

snormal    <          shtmltag    thtmltag
snormal    '&copy;'   shtmltag    tresword
shtmltag   [^>]*      shtmltag    thtmltag

Regular expressions that are supported are greatly lightened version of Perl regular expressions, so only the basics of Perl regular expressions work.

The following wildcard caracters in regular expression are recognized:

^ A circumflex at the start of the string matches the start of a line.
$ A dollar sign at the end of the expression matches the end of a line.
. A period matches any character.
* An asterisk after a string matches any number of occurrences of that string followed by any characters, including zero characters. For example, bo* matches bot, bo and boo but not b.
+ A plus sign after a string matches any number of occurrences of that string followed by any characters except zero characters. For example, bo+ matches boo, and booo, but not bo or be.
[ ] Characters in brackets match any one character that appears in the brackets, but no others. For example [bot] matches b, o, or t.
[^] A circumflex at the start of the string in brackets means NOT. Hence, [^bot] matches any characters except b, o, or t.
[-] A hyphen within the brackets signifies a range of characters. For example, [b-o] matches any character from b through o.
\ A backslash before a wildcard character tells the code editor to treat that character literally, not as a wildcard. For example, \^ matches ^ and does not look for the start of a line.

Table of Contents


Examples included with the software

You can find some interesting example scripts in the \data\hscripts sub-folder under the folder where you have installed the application.

Table of Contents