.Dd September 14, 1996 .Dt FSM 1 .Os NetBSD 1.2BETA .Sh NAME .Nm fsm .Nd finite-state-machine parsing language compiler .Sh SYNOPSIS .Nm \*(Lt .Ar inputfile \*(Gt .Ar outputfile .Sh DESCRIPTION .Nm is a compiler from a finite-state-machine language to C. The state machine is read from the standard input; the resulting C code is sent to the standard output. .Sh INPUT LANGUAGE The input language is loosely patterned after the old VMS LIB$TPARSE facility's language, but .Nm has many capabilities not present in LIB$TPARSE. The input has C-style comments stripped and is then taken as a series of lines. Blank lines are ignored; any remaining lines must be of one of the following forms: .Bl -tag -width indent .It \&$$ Ar text Ns ... Everything after the $$ is simply copied verbatim to the output file. This can be used to declare routines, include files, or anything else. No syntax checking is done; it is entirely possible to generate syntax errors in the resulting C file with $$ lines. All such output appears before any of the .Nm fsm Ns -generated output. .It \&$prefix Ar string A $prefix line specifies a prefix string that is applied to most symbols in the generated C code. All symbols beginning with .Ar string should be considered as private to the generated code; attempting to use any of them (except as documented) is liable to break. Also, .Ar string alone is the name of the principal entry point to the generated parser. .It \&$trace Ar string .Ar string is the name of a routine to call every time a new state is entered. The routine is passed two arguments, an .Sq int and a .Sq const char * , and its return value, if any, is ignored. No declaration for the trace routine is provided by .\" Grrr, why doesn't ".Nm ." work? .Nm fsm . .It \&$initial Ar string .Ar string is the name of a routine which is called when the parser is entered, after everything is set up but before the first state is entered. The routine is passed no arguments and has access to the usual callbacks as if it were an action routine. .It \&$action Ar string .Ar string is the name of a routine to call every time an action routine (see the description of $tran lines) is called. It is passed a single .Sq const char * , and its return value, if any, is ignored. No delcaration for the action trace routine is provided by .\" Grrr, why doesn't ".Nm ." work? .Nm fsm . .It \&$anyof Ar name Ar characters Ns ... .Ar name is declared as a name for the specified set of .Ar characters , for use with $anyof triggers on $tran lines. Whitespace is ignored between the .Ar name and the .Ar characters , but not within the .Ar characters ; to specify whitespace in the .Ar characters , either put a non-whitespace character first or backslash the first whitespace character. Backslashes within the .Ar characters are special; they can specify C-style octal escapes, C-style single-character escapes (\ea, \eb, \ee, \ef, \en, \er, \et are supported, though \e followed by any other letter should be considered reserved for future such escapes); \e followed by any other character simply quotes that character. .It \&$state Op Ar name A $state line declares a new state in the state machine. If the optional .Ar name is given, the state can be referred to by that name from lexically distant parts of the state table; otherwise, the only way to refer to it is from $tran lines in the previous state (or by virtue of its being the first state \- see the PARSER INTERFACE section below). .It \&$tran Xo .Ar trigger .Op \-\*(Gt Ar state .Op Oo \&? Oc Ar action .Xc A $tran line specifies a transition from one state to another. Note that, unlike a pure theoretical finite state machine, .Nm fsm Ns -generated code examines the transitions out of a state in the order they were declared and takes the first one that matches, regardless of whether any others might match. The .Ar trigger specifies under what conditions the transition will match; since the list of them is long, it is given after the rest of the $tran line is described. .Ar action , if specified, is the name of an action routine to call. It can either be a simple name by itself, or it can be a full C call, including the parentheses; the parser makes an argument available, which can be passed to the action routine by using the pseudo-argument .Sq \&$arg as one of the arguments in the call. If the .Sq \&? is given, the action routine is assumed to return a value that can be treated as a conditional, and the return value must be .Sq true (ie, non-zero) for the transition to succeed (it is called only if the transition would otherwise succeed); if the .Sq \&? is not given, the action routine's return value, if any, is ignored. If the \-\*(Gt and .Ar state are given, then when the transition succeeds, the state entered is the one named .Ar state ; otherwise, the state entered is the one specified by the next $state line in the input. The .Ar state can also be one of the special strings .Sq $exit or .Sq $fail , which cause the parse to succeed or fail, respectively, if the transition would otherwise be taken. If none of the $tran lines for a state match, the parse fails, as if there were a lambda transition to $fail after all the specified $tran lines. .Ar trigger specifications can be: .Bl -tag -width indent .It \&$any The transition matches any single character. The only way this trigger can fail to match is at the end of the input string. The action routine argument is an .Sq int holding the code of the character matched. .It \&$anyof Ar name The transition matches any character in the set declared with an $anyof declaration with the matching .Ar name . It is an error for there to be no such declaration. The action routine argument is an .Sq int holding the code of the character matched. .It \&$binary .It \&$decimal .It \&$octal .It \&$hex The transition matches a number in the appropriate base. (Both upper and lower case letters A through F are acceptable to $hex.) The action routine argument is the number's value, as an .Sq unsigned long int . .It \&' Ns Ar char Ns \&' The transition matches the single character given. The .Ar char may use backslash escapes, which function as described above for $anyof declarations. The action routine argument is an .Sq int holding the character's code. .It \&$digit The transition matches any single digit (as defined by .Xr isdigit 3 ) . The action routine argument is an .Sq int holding the character's code (not the value of the single-digit number it forms; for example, if $digit matches a .Sq 3 , the value passed will be '3', (eg, 51 if ASCII is in use), not 3). .It \&$eos The transition matches only at end-of-string. No characters are consumed. The action routine argument is the integer constant 0. .It \&$lambda The transition always matches; no characters are consumed. The action routine argument is the integer constant 0. .It Xo .Qq Ar string .Op \&* Qq Ar suffix .Xc The transition matches the .Ar string ; if the .Sq \&* and .Ar suffix are present, the match succeeds only when the .Ar string is immediately followed by the .Ar suffix in the input string. The characters matching .Ar string are consumed .Pf ( Ar suffix , if specified, is not consumed). The action routine argument is a .Sq char * pointing to a (NUL-terminated) temporary copy of the matched string. The storage pointed to is not valid after the action routine returns; it must be copied if it is to be saved. .It \&$symbol The transition matches a .Sq symbol , which is defined to be a string of alphanumerics, dollar signs, and underscores, except that the first character must not be a digit. .Pf ( Xr isalpha 3 and .Xr isalnum 3 are used to test characters.) The action routine argument is a .Sq char * pointing to a (NUL-terminated) temporary copy of the matched string. The storage pointed to is not valid after the action routine returns; it must be copied if it is to be saved. .It \&! Ns Ar name When a transition of this form is being considered, the current position in the input string is remembered, then the state machine is re-entered at the state named .Ar name , with the input string pointer unmoved. The parse proceeds until it either succeeds or fails; if it succeeds, the transition matches, consuming the characters consumed by the sub-parse; if it fails, the transition fails and the input string is backed up to where it was when the sub-parse began. Since action routine calls cannot, in general, be undone, care is required when the sub-parse involves action routines. .El .El .Sh GENERATED PARSER The interface to the resulting C code takes the form of one include file and three routines. The include file is .Aq Pa fsm.h ; the three routines are .Ar prefix , .Ar prefix Ns getarg , and .Ar prefix Ns rest , where .Ar prefix is the string specified on a $prefix line, or FSM if no $prefix line is given. Argument patterns and return values are .Pp .Dl int Ar prefix Ns (char *) .Dl FSMarg * Ns Ar prefix Ns getarg(void) .Dl char * Ns Ar prefix Ns rest(void) .Pp The first of these is the main interface to the parser; it enters the state machine at the first $state line and proceeds until the parse succeeds or fails; it returns nonzero if the parse succeeds and zero if it fails. .Pp The second allows access to a flags field; the .Sq FSMarg \&* returned points to a structure with an .Sq int flags member; the only documented bit in this field is .Dv FSM_FLAG_PARSE_BLANKS , which if set indicates that all characters are significant; if it is clear (the default), then immediately upon entry to each state, all leading whitespace will be silently consumed from the input string. .Pp The third returns a pointer into the string passed to the first; the pointer is the current position of the state machine in the input string. This is most useful after a parse has succeeded or failed; when that occurs, this points to the rest of the input string. For success, this can be used to parse more of the input; for failure, it can be used to provide some indication in an error message of where the error was found. .Sh BUGS The generated C code is definitely not thread-safe; indeed, it is not even recursion-safe. (An action routine, for example, cannot recursively call the same parser again.) The interface would need a drastic rework to cure even the second, never mind the first. However, different parsers (with different $prefix strings) are completely independent. .Pp The parser that handles action routine arguments is very simplistic. In particular, its detection of .Sq \&$arg pseudo-arguments can easily be fooled by such things as strings containing .Sq \&$arg in the argument list. If this happens, putting parentheses around the argument will protect it. Its counting of parentheses is also very stupid, and can easily be fooled by parentheses in character constants or strings. If this happens (which will probably produce a syntax error), adding appropriate .Sq matching parentheses in comments will cure it, as in .Dl \&$tran '('->newstate actionroutine('('/*)*/,3,"foo") .Sh AUTHOR der Mouse, .Aq mouse@rodents.montreal.qc.ca .