Tree-sitter-tm1: A language grammar for syntax highlighting using tree-sitter

A forum to post information about tools which are free and open source.
Post Reply
bgregs
Posts: 77
Joined: Wed Sep 12, 2018 11:19 am
OLAP Product: TM1 / Planning Analytics
Version: 2.0
Excel Version: 2016

Tree-sitter-tm1: A language grammar for syntax highlighting using tree-sitter

Post by bgregs »

Project site: https://git.sr.ht/~graybeard/tree-sitter-tm1

Status
  • (In progress) TurboIntegrator reserved words
  • (Completed) ASCII and Text TurboIntegrator Functions
  • (In progress) Attribute Manipulation TurboIntegrator Functions
  • Chore Management TurboIntegrator Functions
  • Cube Manipulation TurboIntegrator Functions
  • Data Reservation TurboIntegrator Functions
  • Date and Time TurboIntegrator Functions
  • Dimension Manipulation TurboIntegrator Functions
  • Hierarchy Manipulation TurboIntegrator Functions
  • ODBC TurboIntegrator Functions
  • Process Control TurboIntegrator Functions
  • Rules Management TurboIntegrator Functions
  • Sandbox Functions
  • Security TurboIntegrator Functions
  • Server Manipulation TurboIntegrator Functions
  • Subset Manipulation TurboIntegrator Functions
  • View Manipulation TurboIntegrator Functions
  • Miscellaneous TurboIntegrator Functions
Although I no longer work with TM1, I still find that it's a great place to test out porting some of the latest technologies/capabilities. As some of you may or may not be aware, the world of syntax highlighting and source code parsing has advanced quite a bit and is slowly (but steadily) moving away from regex based matching to the concept of syntax trees. I won't go much further into the details as it could quickly derail the point of this post, but if you're curious, see the end of the post for more details.

I've begun work on building a language grammar for TurboIntegrator processes. The project (linked at the top) is a bit slow going due to how TI syntax works, but it's coming along. Unlike other languages, there's not a whole lot of flexibility when it comes to TI logic, meaning that much of the language grammar is matching syntax for builtin functions. This isn't terrible, but it does mean that there's a lot of room to group and condense rules, which is a very iterative process. It also means that this will probably be one of the larger tree-sitter grammars, but I'm hoping over time things could be grouped more efficiently.

Support is limited to some text functions and a few attribute manipulation functions, but I'm adding more pretty frequently. To give an example of what it looks like, let's take a dummy text file (test.pro) with some valid TI code:

test.pro

Code: Select all

# This is a comment
asciidelete('myfile.txt');
asciioutput("my_file.txt", 'abc', v1);
asciioutputopen("my_file.txt", FILE_OPEN_APPEND());
numbertostring(12345.123);
numbertostringex(12345.123, "abc", ".", ',');
setinputcharacterset("TM1CS_UTF8");
setoutputcharacterset('file_name.txt', "TM1CS_UTF8");
setoutputescapedoublequote("my_file.txt", 1);
stringtonumber('12345.57');
stringtonumberex('12345.57', ".", '.');
textoutput('my_fil123.txt', "myVar", var1, var2);
### This is another comment
attrnl('Model', "abc", "size", "en");
attrsl('Model', "efg", "size", "en");
attrdelete("dimension", 'name');
attrinsert('Model', "Transmission", 'InteriorColor', 'N');
attrputn(123.45, "Dimension", "ElName", 'AttrName', "ProdCode");
attrputs("blah", "Dimension", "ElName", 'AttrName', "ProdCode");
By itself the code is pretty uninspiring, but after running it through our parser (created off of the language grammar) we get the following syntax tree:

Code: Select all

(source_file [0, 0] - [19, 0]
  (comment [0, 0] - [0, 19])
  (builtin_funcs [1, 0] - [1, 26]
    (ascii_and_text_funcs [1, 0] - [1, 25]
      (asciidelete [1, 0] - [1, 25]
        (quoted_string [1, 12] - [1, 24]))))
  (builtin_funcs [2, 0] - [2, 38]
    (ascii_and_text_funcs [2, 0] - [2, 37]
      (ascii_or_text_output [2, 0] - [2, 37]
        (quoted_string [2, 12] - [2, 25])
        (quoted_string [2, 27] - [2, 32])
        (word [2, 34] - [2, 36]))))
  (builtin_funcs [3, 0] - [3, 51]
    (ascii_and_text_funcs [3, 0] - [3, 50]
      (asciioutputopen [3, 0] - [3, 50]
        (quoted_string [3, 16] - [3, 29]))))
  (builtin_funcs [4, 0] - [4, 26]
    (ascii_and_text_funcs [4, 0] - [4, 25]
      (numbertostring [4, 0] - [4, 25]
        (float [4, 15] - [4, 24]))))
  (builtin_funcs [5, 0] - [5, 45]
    (ascii_and_text_funcs [5, 0] - [5, 44]
      (numbertostringex [5, 0] - [5, 44]
        (float [5, 17] - [5, 26])
        (quoted_string [5, 28] - [5, 33])
        (num_separator [5, 35] - [5, 38])
        (num_separator [5, 40] - [5, 43]))))
  (builtin_funcs [6, 0] - [6, 35]
    (ascii_and_text_funcs [6, 0] - [6, 34]
      (setinputcharacterset [6, 0] - [6, 34]
        (char_set [6, 22] - [6, 32]))))
  (builtin_funcs [7, 0] - [7, 53]
    (ascii_and_text_funcs [7, 0] - [7, 52]
      (setoutputcharacterset [7, 0] - [7, 52]
        (quoted_string [7, 22] - [7, 37])
        (char_set [7, 40] - [7, 50]))))
  (builtin_funcs [8, 0] - [8, 45]
    (ascii_and_text_funcs [8, 0] - [8, 44]
      (setoutputescapedoublequote [8, 0] - [8, 44]
        (quoted_string [8, 27] - [8, 40]))))
  (builtin_funcs [9, 0] - [9, 27]
    (ascii_and_text_funcs [9, 0] - [9, 26]
      (stringtonumber [9, 0] - [9, 26]
        (quoted_string [9, 15] - [9, 25]))))
  (builtin_funcs [10, 0] - [10, 39]
    (ascii_and_text_funcs [10, 0] - [10, 38]
      (stringtonumberex [10, 0] - [10, 38]
        (quoted_string [10, 17] - [10, 27])
        (num_separator [10, 29] - [10, 32])
        (num_separator [10, 34] - [10, 37]))))
  (builtin_funcs [11, 0] - [11, 49]
    (ascii_and_text_funcs [11, 0] - [11, 48]
      (ascii_or_text_output [11, 0] - [11, 48]
        (quoted_string [11, 11] - [11, 26])
        (quoted_string [11, 28] - [11, 35])
        (word [11, 37] - [11, 41])
        (word [11, 43] - [11, 47]))))
  (comment [12, 0] - [12, 27])
  (builtin_funcs [13, 0] - [13, 37]
    (attr_funcs [13, 0] - [13, 36]
      (attrnl_and_attrsl [13, 0] - [13, 36]
        (quoted_string [13, 7] - [13, 14])
        (quoted_string [13, 16] - [13, 21])
        (quoted_string [13, 23] - [13, 29])
        (quoted_string [13, 31] - [13, 35]))))
  (builtin_funcs [14, 0] - [14, 37]
    (attr_funcs [14, 0] - [14, 36]
      (attrnl_and_attrsl [14, 0] - [14, 36]
        (quoted_string [14, 7] - [14, 14])
        (quoted_string [14, 16] - [14, 21])
        (quoted_string [14, 23] - [14, 29])
        (quoted_string [14, 31] - [14, 35]))))
  (builtin_funcs [15, 0] - [15, 32]
    (attr_funcs [15, 0] - [15, 31]
      (attrdelete [15, 0] - [15, 31]
        (quoted_string [15, 11] - [15, 22])
        (quoted_string [15, 24] - [15, 30]))))
  (builtin_funcs [16, 0] - [16, 58]
    (attr_funcs [16, 0] - [16, 57]
      (attrinsert [16, 0] - [16, 57]
        (quoted_string [16, 11] - [16, 18])
        (quoted_string [16, 20] - [16, 34])
        (quoted_string [16, 36] - [16, 51])
        (attr_type [16, 54] - [16, 55]))))
  (builtin_funcs [17, 0] - [17, 64]
    (attr_funcs [17, 0] - [17, 63]
      (attrputn_and_attrputs [17, 0] - [17, 63]
        (float [17, 9] - [17, 15])
        (quoted_string [17, 17] - [17, 28])
        (quoted_string [17, 30] - [17, 38])
        (quoted_string [17, 40] - [17, 50])
        (quoted_string [17, 52] - [17, 62]))))
  (builtin_funcs [18, 0] - [18, 64]
    (attr_funcs [18, 0] - [18, 63]
      (attrputn_and_attrputs [18, 0] - [18, 63]
        (quoted_string [18, 9] - [18, 15])
        (quoted_string [18, 17] - [18, 28])
        (quoted_string [18, 30] - [18, 38])
        (quoted_string [18, 40] - [18, 50])
        (quoted_string [18, 52] - [18, 62])))))
This is extremely powerful because we now have a tokenized form of our TI source file. The possibilities come down to the imagination of the developer and what they want to integrate into different editors (most editors have support for cool features like syntax-based error reporting, code folding, etc.).

Running our source file through the command line syntax highlighting utility produces the following (pasted as an image):

syntax.jpeg
syntax.jpeg (134.9 KiB) Viewed 356 times

Just to reiterate the flexibility of having a syntax tree, I produced a highlighted version of my code independent of an editor, in my terminal.

Tree-sitter can also report errors in syntax. Let's assume we forgot a semicolon on the last statement of test.pro. The syntax tree would display the following:

Code: Select all

...
(builtin_funcs [18, 0] - [18, 63]
    (attr_funcs [18, 0] - [18, 63]
      (attrputn_and_attrputs [18, 0] - [18, 63]
        (quoted_string [18, 9] - [18, 15])
        (quoted_string [18, 17] - [18, 28])
        (quoted_string [18, 30] - [18, 38])
        (quoted_string [18, 40] - [18, 50])
        (quoted_string [18, 52] - [18, 62])))))
test.pro	   0.12 ms	  6474 bytes/ms	(MISSING ";" [18, 63] - [18, 63]) 
Notice how it points to the exact line and column of the error and even attempts to recommend a solution (in this case, MISSING instead of ERROR)? Pretty cool!

That pretty much sums it up! In theory this could enable a lot more than just syntax highlighting in both the web and various applications that support tree-sitter. Each editor will have its own way of loading the language grammar (I personally use Emacs for, well, everything), so you'll need to consult that specific documentation on a case-by-case basis.

If you want to contribute, please send patches using the git-send-email workflow to dev@mailcd.com. Last note, this is licensed under the 3-clause BSD license, so feel free to use it, fork it, extend it, or sell it.

---

Additional Details:

High-level the tree-sitter process works like this:
  • A parser generator is provided a language grammar
  • A language parser is created based on the language grammar
  • The parser is used to build a concrete syntax tree (CST) of a syntax file
  • The CST is paired with highlighting rules to match syntax nodes with highlighting rules (colors, etc.)
The syntax tree is updated extremely efficiently, and performance is orders of magnitude better than traditional regex based matching.

There are multiple parser generator projects, but tree-sitter seems to be gaining wide adoption and is integrating into many larger projects. Tree-sitter can also be used in both applications and the browser, making it extremely versatile. Tree-sitter by itself just produces the CST (and may provide highlighting groups), but how the CST is used is completely open-ended. Although commonly used for syntax highlighting, it can also be used for things such as code folding, syntax-based error reporting, and other useful features.

Here's a list of a few places where tree-sitter is currently being used (either directly or through a plugin):
  • Emacs (as of version 29)
  • Atom
  • VSCode
  • GitHub (used in the browser)
  • Probably more
When looking for an overview of tree-sitter to link to this post, I came across this random person discovering the usefulness of tree-sitter in neovim. A bit of an odd video, but it suits my needs: https://www.youtube.com/watch?v=CPB6e5AGk6s

If you're interested in finding out more, I'd recommend starting with the tree-sitter documentation: https://tree-sitter.github.io/tree-sitter/
Last edited by bgregs on Fri May 10, 2024 11:27 am, edited 1 time in total.
Wim Gielis
MVP
Posts: 3128
Joined: Mon Dec 29, 2008 6:26 pm
OLAP Product: TM1, Jedox
Version: PAL 2.0.9.18
Excel Version: Microsoft 365
Location: Brussels, Belgium
Contact:

Re: Tree-sitter-tm1: A language grammar for syntax highlighting using tree-sitter

Post by Wim Gielis »

Wow ! Even understanding only half of the above, I already find it fascinating :D
I would be very much interested in seeing how this evolves and how it opens up new possibilities.
Best regards,

Wim Gielis

IBM Champion 2024
Excel Most Valuable Professional, 2011-2014
https://www.wimgielis.com ==> 121 TM1 articles and a lot of custom code
Newest blog article: Deleting elements quickly
Post Reply