Parsers¶
The hyperbase.parsers module provides the interface for parsing natural language into Semantic Hypergraphs. Hyperbase uses a plugin architecture: the core package defines an abstract Parser class and a discovery mechanism, while concrete parser implementations are installed as separate Python packages.
Available parsers¶
| Package | Plugin name | Description |
|---|---|---|
hyperbase-parser-ab |
alphabeta |
AlphaBeta parser, based on spaCy. Open source (MIT). |
hyperbase-parser-gen |
generative |
Multilingual generative parser, based on a fine-tuned transformer model. Proprietary. |
AlphaBeta is the classical parser for Semantic Hypergraphs. It supports any language for which a spaCy model is available (see the installation guide for language-specific setup).
Generative is the modern parser that produces high-quality parses across many languages. It requires a GPU for acceptable speed. Contact us if you are a researcher and wish to have early access.
Getting a parser¶
Parsers are obtained by name with get_parser():
The keyword arguments are forwarded to the parser constructor. Each parser plugin defines its own parameters -- for example, alphabeta takes a lang code, while generative accepts model_path, device, max_length, and others. Run hyperbase repl --parser <name> --help (or hyperbase read --parser <name> --help) to see the full set of CLI flags injected by the active plugin.
To see which parsers are installed:
from hyperbase.parsers import list_parsers
for name, entry_point in list_parsers().items():
print(f"{name}: {entry_point.value}")
Or from the command line:
Parsing text¶
Single sentence¶
The parse() method takes a text string, splits it into sentences, and yields one ParseResult per sentence:
For a single sentence, you can also use parse_sentence() directly, which returns a list of ParseResult objects:
Longer texts¶
For texts with many sentences, parse_text() handles sentensization and batching automatically:
results = parser.parse_text(
"The sky is blue. Birds are singing.",
batch_size=8,
progress=True, # shows a tqdm progress bar
)
for result in results:
print(result.text, "->", result.edge)
Under the hood, parse_text() splits the text into sentences, groups them into batches, and calls parse_batch() on each batch. Parser plugins can override parse_batch() to exploit hardware parallelism (e.g. batched GPU inference).
Reading and parsing sources¶
The Parser class integrates with the readers module to parse text from files, URLs and Wikipedia articles in a single call:
# Iterate over parse results block by block
for results in parser.read_source("article.txt"):
for result in results:
print(result.edge)
# Or write everything to a JSONL file
parser.read_source_to_jsonl("article.txt", "output.jsonl", progress=True)
Both methods accept an optional reader argument to force a specific reader instead of relying on auto-detection. See the readers documentation for details.
ParseResult¶
Every parse operation returns ParseResult objects. This is a dataclass with the following fields:
| Field | Type | Description |
|---|---|---|
edge |
Hyperedge |
The parsed Semantic Hypergraph edge. |
text |
str |
The original sentence text. |
tokens |
list[str] |
The tokens extracted from the sentence. |
tok_pos |
Hyperedge |
A hyperedge mapping token positions to atoms. |
failed |
bool |
True if the parse failed. Defaults to False. |
errors |
list[str] |
Error messages, if any. |
extra |
dict |
Parser-specific extra data (e.g. raw model output, candidates). |
source |
dict |
Metadata about the source of the text. |
Serialization¶
ParseResult can be serialized to and from JSON:
# To JSON string
json_str = result.to_json()
# From JSON string
result = ParseResult.from_json(json_str)
# To/from dict
d = result.to_dict()
result = ParseResult.from_dict(d)
This is what read_source_to_jsonl() uses internally -- each line in the output file is one ParseResult serialized as JSON.
Quality checking¶
Badness/correctness checking lives in the parser plugin that needs it. The generative parser ships hyperbase_parser_gen.correctness.badness_check for combined structural + token-matching validation; see that package's docs for usage.
CLI¶
Listing parsers¶
Shows all installed parser plugins and their entry point values.
Interactive REPL¶
The REPL lets you parse sentences interactively:
Inside the REPL, type a sentence to parse it. Use /help to see available commands, /settings to view current configuration, and /set to change settings on the fly (e.g. /set parser generative). The REPL caches parser instances, so switching between parsers is fast after the first load.
Reading and parsing files¶
# Parse a file to JSONL
hyperbase read article.txt -o output.jsonl --parser alphabeta --lang en
# Parse a Wikipedia article
hyperbase read https://en.wikipedia.org/wiki/Hypergraph -o output.jsonl
See the readers documentation for the full set of hyperbase read options.
Custom parsers¶
To create a custom parser, subclass Parser and implement:
__init__(params)-- constructor accepting a dictionary of parser parameters.get_sentences(text)-- split a text string into a list of sentences.parse_sentence(sentence)-- parse a single sentence and return a list ofParseResultobjects.accepted_params()(classmethod) -- return a dict describing the parameters the parser accepts.
Optionally, override parse_batch(sentences) if your parser can process multiple sentences more efficiently in a single call.
from hyperbase.parsers import Parser, ParseResult
from hyperbase.hyperedge import hedge
class MyParser(Parser):
@classmethod
def accepted_params(cls):
return {
"lang": {
"type": str, "default": None,
"description": "Language code.", "required": True,
},
}
def __init__(self, params=None):
super().__init__(params)
self.lang = self.params["lang"]
def get_sentences(self, text):
# simple sentence splitting
return [s.strip() for s in text.split('.') if s.strip()]
def parse_sentence(self, sentence):
edge = hedge(f"(says/P someone/C {sentence.split()[0]}/C)")
return [ParseResult(
edge=edge,
text=sentence,
tokens=sentence.split(),
tok_pos=edge,
)]
Registering as a plugin¶
To make a parser discoverable by get_parser(), register it as an entry point in your package's pyproject.toml:
After installation, get_parser("myparser") will instantiate your parser, and hyperbase parsers will list it alongside the built-in ones.