I sure wish that was true for everybody creating new languages! (Note that I did not refer explicitly to C and all the languages derived from it.)
jschell wrote:In Computer Science the area of Compiler Theory is very old and very well studied.
The problem is in creating the language in the first place such that it is deterministic and second it creating a compiler that can report errors. That last part is the most substantial part of every modern compiler (even toy ones.)Reminds me of VAX/VMS: Every message delivered by system software (including compilers) were headed by a unique but language independent numeric code. Support people always asked you to supply the code; the message text could be in any language - they never read that anyway.
So for every language added it is reasonable to expect that the number of keywords would be duplicated.You are missing my point completely. Neither if, then, si, alors, om or så, are reserved words in the language. The language would define non-text tokens, call them [if] and [then] if you like, but the representation is binary, independent of any text.
Keywords often cannot be used in code both because it makes it much harder for the compiler to figure it out and for it to correctly report on errors.Noone is suggesting that you are allowed to use the binary [if] token as a user defined symbol.
The display representation of the binary [if] token could be e.g. as (boldface) if, or as [if], si, [si], om, [o] or some other way to visually highlight that this is not a user identifier but a control statement token. For creation of new control structures, an IDE working directly on a parse tree representation could provide function keys for inserting complete control skeletons. I have been working with several systems working that way, both for data structures, graphic strucures - and for program code, although the latter inserted textual keywords, not binary tokens the way I wish it to do. Once you get out of the habit of thinking of your program as a flat string of 7-bit-ASCII characters, it it actually quite convenient! (You can assign the common structures, like if/else, loops, methods etc. to F1-F13 keys so that you don't have to move your hand over to the mouse for selecting from a menu.)
So not only would the number of keywords increase but the programmer would still need to be aware of all of those keywords while coding.Quite to the contrary! The programmer might very well define a variable named if, which is distinct from the binary token [if]. There would be no reserved words on the textual level.
A not very well known fact: Classic FORTRAN actually managed without reserved words. I just posted an entry in 'The Weird and the Wonderful' - something from my student days that I found in a box in the basement - to illustrate the point. Note, however, that F77 philosophy is not what I am asking for: It did not represent control (and other) structures by binary tokens, but relied on semantic analysis of plain text source code.
1. Two programmers are working on the same file. The file MUST be syntactically correct before developer A (English) goes on vacation. Because otherwise the mechanism (code) that must translate it back from the english form will not work when Developer B is french.I say again: You missed my point completely. If the IDE stores the code as a parse tree, it is syntactically correct, otherwise the IDE would not have accepted it. Of course developer B may define user variables and methods with French names, but so he can in any IDE environment.
2. Comments cannot be supported.Why can't the parser define a binary 'comment' token, and store that in the parse tree? In one project I am currently working on (which is not a general programming language, but an application specific control language), we are doing exactly that. The comment token may have a value field with several alternate texts, each identified by a language code, so that if you select, say, French as you UI language and there is a French version of the comment, that is the one to be displayed. (Otherwise, when the English, say, comment is displayed, you can add a French translation of it.)
3. Third party APIs would still require whatever is supported by by the 3rd party service (library, Rest, TCP, whatever.)I have been working with third party APIs with French method and parameter names, in an otherwise English language environment; it was a nightmare ... If you define a language along the lines I am suggesting, a library would be delivered as a parse tree as well, along with one or more (i.e. different languages) symbol tables for use in the API. (This is how we do it in that application control language mentioned above). Otherwise, if the binary interface is given, the library comes in a compiled, linkable format with given entry point symbols, your parse tree interface to that library should include a mapping from a call token to the entry point symbol, unlinking that symbol from the external display. Establishing this mapping is a one-time operation that could follow the library file, similar to how a '.h' file follows a C library.
4. Adding new languages to the compiler after first release would mean that existing applications could break because existing code might use them.Assuming that you refer to language features, introducing new keywords. If there are no keywords, the problem you are pointing to, vanishes. Adding a new binary token, with its unique token ID, would not invalidate any program whatsoever. Of course there is the question of where the display mapping is done: If the IDE does it, and imports a new compiler with new binary tokens, it might not have a proper French or Swedish word to represent it. If the new and extended compiler is delivered with a token display mapping table for a number of languages, the problem is significantly reduced. (The user may have a language fallback list, both for comments and other binary tokens, so that something meaningful is displayed, although not in the primary language.)
As I wrote in my post,
Almost all of your comments are fundamentally based on the idea that a source program really, as a matter of fact, is a string of 7-bit-ASCII characters, and this will always remain true. I am suggesting that it is not.
trønderen wrote:This is certainly extremely difficult, probably across the borderline to the impossible, if we insist on thinking along exactly the same tracks as we have always done before, refusing to change our ways even a tiny little bit.
Compare an old style text formatter such as troff with, say, MS Word: You may argue that '\fI' is like a reserved word for italicizing text; you cannot use it as plain text (without quoting). Troff stores everything as plain text. MS Word does not - prior to .docx, the storage format was a true binary format, and even XML is just a storage encoding - internally, the working format is binary, just like before. In MS Word, '\fI' is freely available as document text without quoting. Furthermore, you can move a document from an English MS Word to a French one and then to a Swedish one: The menu texts, help texts etc. change language, yet an edit made in one language version is equally valid in other language versions of MS Word.
I certainly can imagine a programming language, and its parse tree storage format, being designed along the same principles.