Tree-sitter-tm1: A language grammar for syntax highlighting using tree-sitter

bgregs · Post by **bgregs** » Thu May 09, 2024 6:36 pm

Project site: https://git.sr.ht/~graybeard/tree-sitter-tm1

Status

(In progress) TurboIntegrator reserved words
(Completed) ASCII and Text TurboIntegrator Functions
(In progress) Attribute Manipulation TurboIntegrator Functions
Chore Management TurboIntegrator Functions
Cube Manipulation TurboIntegrator Functions
Data Reservation TurboIntegrator Functions
Date and Time TurboIntegrator Functions
Dimension Manipulation TurboIntegrator Functions
Hierarchy Manipulation TurboIntegrator Functions
ODBC TurboIntegrator Functions
Process Control TurboIntegrator Functions
Rules Management TurboIntegrator Functions
Sandbox Functions
Security TurboIntegrator Functions
Server Manipulation TurboIntegrator Functions
Subset Manipulation TurboIntegrator Functions
View Manipulation TurboIntegrator Functions
Miscellaneous TurboIntegrator Functions

Although I no longer work with TM1, I still find that it's a great place to test out porting some of the latest technologies/capabilities. As some of you may or may not be aware, the world of syntax highlighting and source code parsing has advanced quite a bit and is slowly (but steadily) moving away from regex based matching to the concept of syntax trees. I won't go much further into the details as it could quickly derail the point of this post, but if you're curious, see the end of the post for more details.

I've begun work on building a language grammar for TurboIntegrator processes. The project (linked at the top) is a bit slow going due to how TI syntax works, but it's coming along. Unlike other languages, there's not a whole lot of flexibility when it comes to TI logic, meaning that much of the language grammar is matching syntax for builtin functions. This isn't terrible, but it does mean that there's a lot of room to group and condense rules, which is a very iterative process. It also means that this will probably be one of the larger tree-sitter grammars, but I'm hoping over time things could be grouped more efficiently.

Support is limited to some text functions and a few attribute manipulation functions, but I'm adding more pretty frequently. To give an example of what it looks like, let's take a dummy text file (test.pro) with some valid TI code:

test.pro

Code: Select all

# This is a comment
asciidelete('myfile.txt');
asciioutput("my_file.txt", 'abc', v1);
asciioutputopen("my_file.txt", FILE_OPEN_APPEND());
numbertostring(12345.123);
numbertostringex(12345.123, "abc", ".", ',');
setinputcharacterset("TM1CS_UTF8");
setoutputcharacterset('file_name.txt', "TM1CS_UTF8");
setoutputescapedoublequote("my_file.txt", 1);
stringtonumber('12345.57');
stringtonumberex('12345.57', ".", '.');
textoutput('my_fil123.txt', "myVar", var1, var2);
### This is another comment
attrnl('Model', "abc", "size", "en");
attrsl('Model', "efg", "size", "en");
attrdelete("dimension", 'name');
attrinsert('Model', "Transmission", 'InteriorColor', 'N');
attrputn(123.45, "Dimension", "ElName", 'AttrName', "ProdCode");
attrputs("blah", "Dimension", "ElName", 'AttrName', "ProdCode");

By itself the code is pretty uninspiring, but after running it through our parser (created off of the language grammar) we get the following syntax tree:

Code: Select all

(source_file [0, 0] - [19, 0]
  (comment [0, 0] - [0, 19])
  (builtin_funcs [1, 0] - [1, 26]
    (ascii_and_text_funcs [1, 0] - [1, 25]
      (asciidelete [1, 0] - [1, 25]
        (quoted_string [1, 12] - [1, 24]))))
  (builtin_funcs [2, 0] - [2, 38]
    (ascii_and_text_funcs [2, 0] - [2, 37]
      (ascii_or_text_output [2, 0] - [2, 37]
        (quoted_string [2, 12] - [2, 25])
        (quoted_string [2, 27] - [2, 32])
        (word [2, 34] - [2, 36]))))
  (builtin_funcs [3, 0] - [3, 51]
    (ascii_and_text_funcs [3, 0] - [3, 50]
      (asciioutputopen [3, 0] - [3, 50]
        (quoted_string [3, 16] - [3, 29]))))
  (builtin_funcs [4, 0] - [4, 26]
    (ascii_and_text_funcs [4, 0] - [4, 25]
      (numbertostring [4, 0] - [4, 25]
        (float [4, 15] - [4, 24]))))
  (builtin_funcs [5, 0] - [5, 45]
    (ascii_and_text_funcs [5, 0] - [5, 44]
      (numbertostringex [5, 0] - [5, 44]
        (float [5, 17] - [5, 26])
        (quoted_string [5, 28] - [5, 33])
        (num_separator [5, 35] - [5, 38])
        (num_separator [5, 40] - [5, 43]))))
  (builtin_funcs [6, 0] - [6, 35]
    (ascii_and_text_funcs [6, 0] - [6, 34]
      (setinputcharacterset [6, 0] - [6, 34]
        (char_set [6, 22] - [6, 32]))))
  (builtin_funcs [7, 0] - [7, 53]
    (ascii_and_text_funcs [7, 0] - [7, 52]
      (setoutputcharacterset [7, 0] - [7, 52]
        (quoted_string [7, 22] - [7, 37])
        (char_set [7, 40] - [7, 50]))))
  (builtin_funcs [8, 0] - [8, 45]
    (ascii_and_text_funcs [8, 0] - [8, 44]
      (setoutputescapedoublequote [8, 0] - [8, 44]
        (quoted_string [8, 27] - [8, 40]))))
  (builtin_funcs [9, 0] - [9, 27]
    (ascii_and_text_funcs [9, 0] - [9, 26]
      (stringtonumber [9, 0] - [9, 26]
        (quoted_string [9, 15] - [9, 25]))))
  (builtin_funcs [10, 0] - [10, 39]
    (ascii_and_text_funcs [10, 0] - [10, 38]
      (stringtonumberex [10, 0] - [10, 38]
        (quoted_string [10, 17] - [10, 27])
        (num_separator [10, 29] - [10, 32])
        (num_separator [10, 34] - [10, 37]))))
  (builtin_funcs [11, 0] - [11, 49]
    (ascii_and_text_funcs [11, 0] - [11, 48]
      (ascii_or_text_output [11, 0] - [11, 48]
        (quoted_string [11, 11] - [11, 26])
        (quoted_string [11, 28] - [11, 35])
        (word [11, 37] - [11, 41])
        (word [11, 43] - [11, 47]))))
  (comment [12, 0] - [12, 27])
  (builtin_funcs [13, 0] - [13, 37]
    (attr_funcs [13, 0] - [13, 36]
      (attrnl_and_attrsl [13, 0] - [13, 36]
        (quoted_string [13, 7] - [13, 14])
        (quoted_string [13, 16] - [13, 21])
        (quoted_string [13, 23] - [13, 29])
        (quoted_string [13, 31] - [13, 35]))))
  (builtin_funcs [14, 0] - [14, 37]
    (attr_funcs [14, 0] - [14, 36]
      (attrnl_and_attrsl [14, 0] - [14, 36]
        (quoted_string [14, 7] - [14, 14])
        (quoted_string [14, 16] - [14, 21])
        (quoted_string [14, 23] - [14, 29])
        (quoted_string [14, 31] - [14, 35]))))
  (builtin_funcs [15, 0] - [15, 32]
    (attr_funcs [15, 0] - [15, 31]
      (attrdelete [15, 0] - [15, 31]
        (quoted_string [15, 11] - [15, 22])
        (quoted_string [15, 24] - [15, 30]))))
  (builtin_funcs [16, 0] - [16, 58]
    (attr_funcs [16, 0] - [16, 57]
      (attrinsert [16, 0] - [16, 57]
        (quoted_string [16, 11] - [16, 18])
        (quoted_string [16, 20] - [16, 34])
        (quoted_string [16, 36] - [16, 51])
        (attr_type [16, 54] - [16, 55]))))
  (builtin_funcs [17, 0] - [17, 64]
    (attr_funcs [17, 0] - [17, 63]
      (attrputn_and_attrputs [17, 0] - [17, 63]
        (float [17, 9] - [17, 15])
        (quoted_string [17, 17] - [17, 28])
        (quoted_string [17, 30] - [17, 38])
        (quoted_string [17, 40] - [17, 50])
        (quoted_string [17, 52] - [17, 62]))))
  (builtin_funcs [18, 0] - [18, 64]
    (attr_funcs [18, 0] - [18, 63]
      (attrputn_and_attrputs [18, 0] - [18, 63]
        (quoted_string [18, 9] - [18, 15])
        (quoted_string [18, 17] - [18, 28])
        (quoted_string [18, 30] - [18, 38])
        (quoted_string [18, 40] - [18, 50])
        (quoted_string [18, 52] - [18, 62])))))

This is extremely powerful because we now have a tokenized form of our TI source file. The possibilities come down to the imagination of the developer and what they want to integrate into different editors (most editors have support for cool features like syntax-based error reporting, code folding, etc.).

Running our source file through the command line syntax highlighting utility produces the following (pasted as an image):

: syntax.jpeg (134.9 KiB) Viewed 11842 times

Just to reiterate the flexibility of having a syntax tree, I produced a highlighted version of my code independent of an editor, in my terminal.

Tree-sitter can also report errors in syntax. Let's assume we forgot a semicolon on the last statement of test.pro. The syntax tree would display the following:

Code: Select all

...
(builtin_funcs [18, 0] - [18, 63]
    (attr_funcs [18, 0] - [18, 63]
      (attrputn_and_attrputs [18, 0] - [18, 63]
        (quoted_string [18, 9] - [18, 15])
        (quoted_string [18, 17] - [18, 28])
        (quoted_string [18, 30] - [18, 38])
        (quoted_string [18, 40] - [18, 50])
        (quoted_string [18, 52] - [18, 62])))))
test.pro	   0.12 ms	  6474 bytes/ms	(MISSING ";" [18, 63] - [18, 63])

Notice how it points to the exact line and column of the error and even attempts to recommend a solution (in this case, MISSING instead of ERROR)? Pretty cool!

That pretty much sums it up! In theory this could enable a lot more than just syntax highlighting in both the web and various applications that support tree-sitter. Each editor will have its own way of loading the language grammar (I personally use Emacs for, well, everything), so you'll need to consult that specific documentation on a case-by-case basis.

If you want to contribute, please send patches using the git-send-email workflow to dev@mailcd.com. Last note, this is licensed under the 3-clause BSD license, so feel free to use it, fork it, extend it, or sell it.

---

Additional Details:

High-level the tree-sitter process works like this:

A parser generator is provided a language grammar
A language parser is created based on the language grammar
The parser is used to build a concrete syntax tree (CST) of a syntax file
The CST is paired with highlighting rules to match syntax nodes with highlighting rules (colors, etc.)

The syntax tree is updated extremely efficiently, and performance is orders of magnitude better than traditional regex based matching.

There are multiple parser generator projects, but tree-sitter seems to be gaining wide adoption and is integrating into many larger projects. Tree-sitter can also be used in both applications and the browser, making it extremely versatile. Tree-sitter by itself just produces the CST (and may provide highlighting groups), but how the CST is used is completely open-ended. Although commonly used for syntax highlighting, it can also be used for things such as code folding, syntax-based error reporting, and other useful features.

Here's a list of a few places where tree-sitter is currently being used (either directly or through a plugin):

Emacs (as of version 29)
Atom
VSCode
GitHub (used in the browser)
Probably more

When looking for an overview of tree-sitter to link to this post, I came across this random person discovering the usefulness of tree-sitter in neovim. A bit of an odd video, but it suits my needs: https://www.youtube.com/watch?v=CPB6e5AGk6s

If you're interested in finding out more, I'd recommend starting with the tree-sitter documentation: https://tree-sitter.github.io/tree-sitter/

Wim Gielis · Post by **Wim Gielis** » Fri May 10, 2024 7:45 am

Wow ! Even understanding only half of the above, I already find it fascinating

I would be very much interested in seeing how this evolves and how it opens up new possibilities.

TM1 Forum

Tree-sitter-tm1: A language grammar for syntax highlighting using tree-sitter

Tree-sitter-tm1: A language grammar for syntax highlighting using tree-sitter

Re: Tree-sitter-tm1: A language grammar for syntax highlighting using tree-sitter