52 lines
2.5 KiB
Go
52 lines
2.5 KiB
Go
// Package parsekit provides tooling for building parsers using recursive descent and parser/combinator methodology.
|
|
//
|
|
// The two main components for parsing are subpackages 'tokenize' and 'parse'.
|
|
//
|
|
// TOKENIZE
|
|
//
|
|
// The tokenize package's focus is to take input data and to produce
|
|
// tokens from that input, which are bits and pieces that can be extracted
|
|
// from the input data and that can be recognized by the parser.
|
|
//
|
|
// Traditionally a tokenizer would produce general tokens (like 'numbers',
|
|
// 'plus sign', 'letters') without caring at all about the actual structure
|
|
// or semantics of the input. That would be the task of the parser.
|
|
//
|
|
// I said 'traditionally', because the tokenize package provides a
|
|
// parser/combinator-style parser, which allows you to construct complex
|
|
// tokenizers which are parsers in their own right in an easy way.
|
|
// You can even write a tokenizer and use it in a stand-alone manner
|
|
// (see examples - Dutch Postcode).
|
|
//
|
|
// PARSE
|
|
//
|
|
// The parse package's focus is to interpret the tokens as provided
|
|
// by the tokenizer. The intended style for the parser code is a left-to-right
|
|
// recursive descent parser state matchine, constructed from recursive
|
|
// function calls.
|
|
//
|
|
// This might sound intimidating if you're not familiar with the terminology,
|
|
// but don't worry about that. It simply means that you implement your parser
|
|
// by writing functions that know how to handle various parts of the input,
|
|
// and these functions invoke each other based on the input tokens that are
|
|
// found, going from left to right over the input
|
|
// (see examples - Hello Many State Parser).
|
|
//
|
|
// BALANCE BETWEEN THE TWO
|
|
//
|
|
// When writing your own parser using parsekit, you will have to find a
|
|
// good balance between the responsibilities for the tokenizer and the parser.
|
|
// The parser could provide anything from a stream of individual bytes
|
|
// (where the parser will have to do all the work) to a fully parsed
|
|
// and tokenized document for the parser to interpret.
|
|
//
|
|
// In general, recognizing input data belongs in a tokenizer, while interpreting
|
|
// input data belongs in a parser. You can for example perfectly well write
|
|
// parser code that takes individual digit tokens and checks if those make up
|
|
// a phone number, but it is a lot easier to have that handled by a tokenizer.
|
|
//
|
|
// When all you need is to recognize some data, maybe normalize it and extract
|
|
// some bits from it, then you might not even require a parser. A stand-alone
|
|
// tokenizer can do all that.
|
|
package parsekit
|