package tokenize import ( "git.makaay.nl/mauricem/go-parsekit/read" ) // API holds the internal state of a tokenizer run. A tokenizer run uses' // tokenize.Handler functions to move the tokenizer forward through the // input and to provide tokenizer output. // // The methods as provided by the API are used by tokenize.Handler functions to: // // • access and process runes / bytes from the input data // // • flush processed input data that are not required anymore (FlushInput) // // • fork the API for easy lookahead support (Fork, Merge, Reset, Dispose) // // • emit tokens and/or bytes to be used by a parser // // BASIC OPERATION: // // To retrieve the next rune from the API, call the NextRune() method. // // When the rune is to be accepted as input, call the method Accept(). The rune // is then added to the result runes of the API and the read cursor is moved // forward. // // By invoking NextRune() + Accept() multiple times, the result can be extended // with as many runes as needed. Runes collected this way can later on be // retrieved using the method Runes(). // // It is mandatory to call Accept() after retrieving a rune, before calling // NextRune() again. Failing to do so will result in a panic. // // Next to adding runes to the result, it is also possible to modify the // stored runes or to add lexical Tokens to the result. For all things // concerning results, take a look at the Result struct, which // can be accessed though the method Result(). // // FORKING OPERATION FOR EASY LOOKEAHEAD SUPPORT: // // Sometimes, we must be able to perform a lookahead, which might either // succeed or fail. In case of a failing lookahead, the state of the // API must be brought back to the original state, so we can try // a different route. // // The way in which this is supported, is by forking an API struct by // calling method Fork(). This will return a forked child API, with // empty result data, but using the same read cursor position as the // forked parent. // // After forking, the same interface as described for BASIC OPERATION can be // used to fill the results. When the lookahead was successful, then // Merge() can be called on the forked child to append the child's results // to the parent's results, and to move the read cursor position to that // of the child. // // When the lookahead was unsuccessful, then the forked child API can // disposed by calling Dispose() on the forked child. This is not mandatory. // Garbage collection will take care of this automatically. // The parent API was never modified, so it can safely be used after disposal // as if the lookahead never happened. // // Opinionized note: // Many tokenizers/parsers take a different approach on lookaheads by using // peeks and by moving the read cursor position back and forth, or by putting // read input back on the input stream. That often leads to code that is // efficient, however, in my opinion, not very intuitive to read. It can also // be tedious to get the cursor position back at the correct position, which // can lead to hard to track bugs. I much prefer this forking method, since // no bookkeeping has to be implemented when implementing a parser. type API struct { reader read.Buffer // the buffered input reader pointers stackFrame // various values for keeping track of input, output, cursor. Input Input // access to a set of general input-related methods Byte InputByteMode // access to a set of byte-based input methods Rune InputRuneMode // access to a set of UTF8 rune-based input methods Output Output // access to a set of output-related methods Result Result // access to a set of result retrieval methods outputTokens []Token // storage for accepted tokens outputBytes []byte // storage for accepted bytes } type stackFrame struct { offset int // the read offset, relative to the start of the reader buffer column int // the column at which the cursor is (0-indexed, relative to the start of the stack frame) line int // the line at which the cursor is (0-indexed, relative to the start of the stack frame) bytesStart int // the starting point in the API.bytes slice for produced bytes bytesEnd int // the end point in the API.bytes slice for produced bytes tokenStart int // the starting point in the API.tokens slice for produced tokens tokenEnd int // the end point in the API.tokens slice for produced tokens } // NewAPI initializes a new API struct, wrapped around the provided input. // For an overview of allowed inputs, take a look at the documentation // for parsekit.read.New(). func NewAPI(input interface{}) *API { tokenAPI := &API{ reader: read.New(input), } tokenAPI.Input = Input{api: tokenAPI} tokenAPI.Input.Byte = InputByteMode{api: tokenAPI} tokenAPI.Input.Rune = InputRuneMode{api: tokenAPI} tokenAPI.Output = Output{api: tokenAPI} tokenAPI.Result = Result{api: tokenAPI} return tokenAPI } type Snapshot stackFrame func (tokenAPI *API) MakeSnapshot() stackFrame { return tokenAPI.pointers } func (tokenAPI *API) RestoreSnapshot(snap stackFrame) { tokenAPI.pointers = snap } type Split [2]int func (tokenAPI *API) SplitOutput() Split { split := Split{tokenAPI.pointers.bytesStart, tokenAPI.pointers.tokenStart} tokenAPI.pointers.bytesStart = tokenAPI.pointers.bytesEnd tokenAPI.pointers.tokenStart = tokenAPI.pointers.tokenEnd return split } func (tokenAPI *API) MergeSplitOutput(split Split) { tokenAPI.pointers.bytesStart = split[0] tokenAPI.pointers.tokenStart = split[1] }