---- Input syntax description The input syntax is loosely patterned after the VMS LIB$TPARSE facility. The bulk of the program consists of state and transition declarations. There are some other lines which are frequently used; they are described after the state and transition declarations. A state is declared with a `$state' line. The syntax of this is $state [name] The optional name is to allow transitions to refer to this state from far away. A transition is declared with a `$tran' line. The syntax of this is $tran trigger [-> statename] [action] The trigger specifies when this transition will be taken. If the arrow and statename are present, they specify what state the transition is to; otherwise, the transition is to the next state listed (the one corresponding to the next $state line in the input). The statename can also be $exit or $fail, causing the parse to succeed or fail respectively. An optional action routine may be specified. This looks like a normal C routine call (without the trailing semicolon that would make it a statement), with one exception: if there are no explicit arguments, the parentheses must be omitted. One extra argument will be added, after any arguments explicitly listed; what this argument is depends on what sort of trigger is given. The taking of a transition consumes zero or more characters of the input string; how much of the string is consumed depends on what sort of trigger is given. The possible transitions out of a state are tried in order; the first one to succeed is the one that is taken. This means that it is legal for more than one transition to match; the one taken is the one that is listed first. If none of the transitions out of a state match, the parse fails as if a transition to $fail had been made. Normally, all leading whitespace (as determined by isspace()) is silently skipped upon entry to each state (ie, immediately before an attempt is made to match the first transition, but after any action routine applying to the transition that entered the state). There is a flag that can be set to inhibit this behavior; see the C interface section for details. Normally, action routines are assumed to return nothing interesting. It is possible to make a transition's success conditional on something returned by the action routine. To do this, prefix the action routine with a question mark. When this is done, the action routine is assumed to return an integer value. If this value is zero, the transition fails; otherwise it succeeds as normal. The trigger can be - a string "...". The transition will be taken if that string is present at the current point in the input string; the entire string will be consumed by the transition. The extra argument passed to the action routine is a pointer to a string literal containing the trigger string (not a pointer into the string being parsed). - a character 'c'. The transition will be taken if that character is the next character in the input string. The transition consumes the character from the input. The extra argument to the action routine is the character. - !statename (where statename must be the name of a state). This corresponds to "calling a subroutine" in the state table. The current point in the input is remembered and the state table is entered at the specified state. The machine is run until a transition to $exit or $fail (including implicit $fail transitions due to failure to match any transitions) is taken. If the subparse succeeds (the transition was to $exit), then the ! transition succeeds and the input pointer is advanced to where it was when the $exit transition was taken. If the subparse failed, the input pointer is backed up to the remembered value and the ! transition fails. Side effects due to action routines called during the (partial) subparse cannot, of course, be undone. The extra argument to the action routine is the integer constant zero. - a keyword specifying a class of characters or a particular sort of character string: $any Matches any character (except NUL). Consumes exactly one character on match. Action routine argument: character matched. $decimal Any nonempty string of digits 0-9. Consumes entire string. Action routine argument: a `long int' containing the value of the number. $digit Any single digit 0-9. Consumes the digit. Action routine argument: the digit matched. (The character, not the corresponding integer 0-9.) $eos Matches at end-of-string only. Consumes nothing. Action routine argument: integer constant zero. $hex Like $decimal except number is in hex. (No leading 0x is included.) $octal Like $decimal and $hex except number is in octal. $binary Ditto for binary. $symbol Matches a string of characters consisting of digits, letters of either case, dollar signs, and underscores, provided it does not begin with a digit. Currently, this string must be no longer than 256 characters. Consumes the entire string. Action routine argument: a copy of the string matched (not a pointer into the string being parsed). This must be copied if it is to be saved; the string passed to the action routine will be destroyed at some more or less unpredictable time after the action routine returns. - the keyword $lambda, specifying a lambda transition. This type of transition always succeeds and consumes no characters. Action routine argument: integer constant zero. - the keyword $anyof, followed by a name that has previously appeared on a line beginning with $anyof (see below). This matches exactly one character from the specified set. It consumes just that character; the action routine argument is the character matched. What other sorts of line are there? They are distinguished by the keyword at the beginning of the line. $state and $tran have already been described. There are also $prefix, $trace, $initial, $action, and $anyof. Blank lines are ignored, and lines beginning with $$ have the $$ stripped and are prepended (in order of appearance, otherwise unchanged) to the output. The $prefix line looks like `$prefix symbol' where symbol is any legal C symbol. This specifies a prefix which is put at the beginning of all variables and functions generated for the machine. This allows you to use more than one fsm in a given program without name clashes. If you don't specify a $prefix line, the default is $prefix FSM The $trace, $action, and $initial lines each take a C function name, as in $trace trace_fxn $action action_fxn $initial initial_fxn The function specified with $trace gets called each time a state is entered. It is called with two arguments. The first argument is a small integer giving the number of the state, which is usually close to its number of appearance in the input (but this is not guaranteed); the second one is the name given on the $state line, if any, otherwise a concocted name of the form "State %d". The function specified with $action gets called just before any action routine is called. It is called with one argument, that being a character string showing the call about to be made, as it appears in the generated C source (not as it appears in the input file). The function specified with $initial gets called once each time the parser is called; it is called just before the first state is entered. In particular, it is called after the parser internals are set up but before the input string is looked at. This could be used, for example, to turn on the FSM_FLAG_PARSE_BLANKS flag for a parser that is not supposed to consider whitespace special. (Beginning the parser with a $lambda transition with an action would not be enough if the whitespace occurs at the beginning of the string.) If there is no $trace line, no trace routine is called for each state entry; if there is no $action line, the corresponding function calls are omitted before the action routines are called. The $anyof line declares a character class for use with the $anyof transition trigger. The syntax is $anyof name ...characters... The name is any legal C symbol; the characters may be any characters. Nonprintable characters are represented as in C, with backslash escapes. Spaces and tabs are ignored following the name up until the first non-whitespace character. (Spaces may be specified as \040 and tabs as \t, if this causes any problem.) The order of the characters in the list is irrelevant and duplicates are ignored. ---- C interface to the resulting code The generated C code contains several functions, some of which are internal to the parser. The ones whose interface is advertised are as follows. (PFX represents the name specified on a $prefix line. The FSMarg type is defined in fsm.h.) PFX(s) char *s; This is the main entry point for the parser. It takes a pointer to the string to parse and returns 0 if the parse failed and 1 if it is succeeded. The state table is entered at the first state given in the input file. FSMarg *PFXgetarg() This returns a pointer to a structure used during the parse. This is provided so that action routine can do things like set and clear flags here. See the fsm.h file for a description of this structure. Currently, the only interesting thing to do is change the FSM_FLAG_PARSE_BLANKS bit in the flags field. This bit is clear by default; when set, the parser's normal action of stripping all whitespace on entry to each state is inhibited. char *PFXrest() Once the main parsing function has returned, this can be called to determine the unparsed portion of the string. It returns a pointer to the unparsed portion. If the parse succeeded, this is where the parser had reached when the transition to $exit was made. If the parse failed, this is an attempt at pointing to the character that caused it to fail. This is not guaranteed, because recursion (! triggers) can confuse things. All other names beginning PFX or _PFX should be considered reserved. Please mail compliments, flames, bug reports/fixes, comments, etc to der Mouse mouse@rodents.montreal.qc.ca 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B