go-parsekit/tokenize/api.go

package tokenize

import (
	"git.makaay.nl/mauricem/go-parsekit/read"
)

// API holds the internal state of a tokenizer run. A tokenizer run uses'
// tokenize.Handler functions to move the tokenizer forward through the
// input and to provide tokenizer output.
//
// The methods as provided by the API are used by tokenize.Handler functions to:
//
// • access and process runes / bytes from the input data
//
// • flush processed input data that are not required anymore (FlushInput)
//
// • fork the API for easy lookahead support (Fork, Merge, Reset, Dispose)
//
// • emit tokens and/or bytes to be used by a parser
//
// BASIC OPERATION:
//
// To retrieve the next rune from the API, call the NextRune() method.
//
// When the rune is to be accepted as input, call the method Accept(). The rune
// is then added to the result runes of the API and the read cursor is moved
// forward.
//
// By invoking NextRune() + Accept() multiple times, the result can be extended
// with as many runes as needed. Runes collected this way can later on be
// retrieved using the method Runes().
//
// It is mandatory to call Accept() after retrieving a rune, before calling
// NextRune() again. Failing to do so will result in a panic.
//
// Next to adding runes to the result, it is also possible to modify the
// stored runes or to add lexical Tokens to the result. For all things
// concerning results, take a look at the Result struct, which
// can be accessed though the method Result().
//
// FORKING OPERATION FOR EASY LOOKEAHEAD SUPPORT:
//
// Sometimes, we must be able to perform a lookahead, which might either
// succeed or fail. In case of a failing lookahead, the state of the
// API must be brought back to the original state, so we can try
// a different route.
//
// The way in which this is supported, is by forking an API struct by
// calling method Fork(). This will return a forked child API, with
// empty result data, but using the same read cursor position as the
// forked parent.
//
// After forking, the same interface as described for BASIC OPERATION can be
// used to fill the results. When the lookahead was successful, then
// Merge() can be called on the forked child to append the child's results
// to the parent's results, and to move the read cursor position to that
// of the child.
//
// When the lookahead was unsuccessful, then the forked child API can
// disposed by calling Dispose() on the forked child. This is not mandatory.
// Garbage collection will take care of this automatically.
// The parent API was never modified, so it can safely be used after disposal
// as if the lookahead never happened.
//
// Opinionized note:
// Many tokenizers/parsers take a different approach on lookaheads by using
// peeks and by moving the read cursor position back and forth, or by putting
// read input back on the input stream. That often leads to code that is
// efficient, however, in my opinion, not very intuitive to read. It can also
// be tedious to get the cursor position back at the correct position, which
// can lead to hard to track bugs. I much prefer this forking method, since
// no bookkeeping has to be implemented when implementing a parser.
type API struct {
	reader       read.Buffer   // the buffered input reader
	pointers     stackFrame    // various values for keeping track of input, output, cursor.
	Input        Input         // access to a set of general input-related methods
	Byte         InputByteMode // access to a set of byte-based input methods
	Rune         InputRuneMode // access to a set of UTF8 rune-based input methods
	Output       Output        // access to a set of output-related functionality
	outputTokens []Token       // storage for accepted tokens
	outputBytes  []byte        // storage for accepted bytes
}

type stackFrame struct {
	offset     int // the read offset, relative to the start of the reader buffer
	column     int // the column at which the cursor is (0-indexed, relative to the start of the stack frame)
	line       int // the line at which the cursor is (0-indexed, relative to the start of the stack frame)
	bytesStart int // the starting point in the API.bytes slice for runes produced by this stack level
	bytesEnd   int // the end point in the API.bytes slice for runes produced by this stack level
	tokenStart int // the starting point in the API.tokens slice for tokens produced by this stack level
	tokenEnd   int // the end point in the API.tokens slice for tokens produced by this stack level
}

// NewAPI initializes a new API struct, wrapped around the provided input.
// For an overview of allowed inputs, take a look at the documentation
// for parsekit.read.New().
func NewAPI(input interface{}) *API {
	tokenAPI := &API{
		reader: read.New(input),
	}
	tokenAPI.Input = Input{api: tokenAPI}
	tokenAPI.Byte = InputByteMode{api: tokenAPI}
	tokenAPI.Rune = InputRuneMode{api: tokenAPI}
	tokenAPI.Output = Output{api: tokenAPI}
	return tokenAPI
}

type Snapshot stackFrame

func (tokenAPI *API) MakeSnapshot() stackFrame {
	return tokenAPI.pointers
}

func (tokenAPI *API) RestoreSnapshot(snap stackFrame) {
	tokenAPI.pointers = snap
}

type Split [2]int

func (tokenAPI *API) SplitOutput() Split {
	split := Split{tokenAPI.pointers.bytesStart, tokenAPI.pointers.tokenStart}
	tokenAPI.pointers.bytesStart = tokenAPI.pointers.bytesEnd
	tokenAPI.pointers.tokenStart = tokenAPI.pointers.tokenEnd
	return split
}

func (tokenAPI *API) MergeSplitOutput(split Split) {
	tokenAPI.pointers.bytesStart = split[0]
	tokenAPI.pointers.tokenStart = split[1]
}