Parser¶

The parse_query function performs syntactic parsing of a Cypher query and returns a QueryInfo summary — no schema required. Use it when you want to know whether a query could parse, and what labels / rel-types / property keys it references.

The parser uses a Pest PEG grammar (src/grammar/cypher.pest, 312 lines) and an AST builder (src/parser/builder.rs, 1040 lines). All of it runs in Rust at ~57 000 queries/s on a single core.

`parse_query(query) → QueryInfo`¶

from cypher_validator import parse_query

info = parse_query(
    "MATCH (p:Person)-[:WORKS_FOR]->(c:Company) "
    "WHERE p.age > 30 "
    "RETURN p.name, c.name"
)

info.is_valid        # True
info.errors          # []
info.labels_used     # ['Company', 'Person']     — sorted
info.rel_types_used  # ['WORKS_FOR']
info.properties_used # ['age', 'name']

QueryInfo is truthy when is_valid is True:

if parse_query(query):
    db.execute(query)

Field	Type	Notes
`is_valid`	`bool`	True iff the parser produced an AST without errors.
`errors`	`list[str]`	One string per parse error, with line/col when known.
`labels_used`	`list[str]`	Sorted, deduplicated; collected from every pattern (incl. subqueries).
`rel_types_used`	`list[str]`	Sorted, deduplicated.
`properties_used`	`list[str]`	Property keys referenced in patterns, WHERE, RETURN, etc.

When to use `parse_query` vs `CypherValidator`¶

Need	Tool
Schema-aware validation, "did you mean?", auto-fix	`CypherValidator`
Fast filter — "is this string even valid Cypher?"	`parse_query`
Provenance — "which labels does this query touch?"	`parse_query`
LLM repair loop with structured diagnostics	`CypherValidator`

parse_query is the building block used by the LLM pipeline's provenance generation: to construct (:Chunk)<-[:MENTIONED_IN]-(domain_node) edges, it parses the LLM-generated Cypher to pull out the domain labels involved.

Labels collected from every nested construct¶

A recent bugfix made the parser walk the entire AST when collecting labels and rel-types. Previously, the following constructs would not surface their labels:

Construct	Now collected?
Top-level `MATCH (p:Person)`	yes (always was)
`CREATE` / `MERGE` patterns	yes (always was)
`EXISTS { (p)-[:LIVES_IN]->(:City) }` subqueries	yes (fixed)
`[(p)-[:KNOWS]->(f) WHERE f.age > 30 \\| f.name]` pattern comprehensions	yes (fixed)
`shortestPath((a)-[*..5]-(b))`	yes (fixed)
`reduce(s = 0, n IN nodes(path) \\| s + n.value)` inner sources	yes (fixed)
`CALL { ... }` subqueries (regular form)	yes (fixed)
`COUNT { MATCH (n:City) RETURN n }` subqueries	yes (fixed)
`COLLECT { MATCH (n:Tag) RETURN n.name }` subqueries	yes (fixed)

info = parse_query(
    "MATCH (p:Person) "
    "WHERE EXISTS { (p)-[:LIVES_IN]->(:City) } "
    "  AND [(p)-[:KNOWS]->(f:Friend) | f.name] <> [] "
    "RETURN p.name, "
    "       COUNT { MATCH (m:Movie)<-[:ACTED_IN]-(p) } AS films"
)

info.labels_used      # ['City', 'Friend', 'Movie', 'Person']
info.rel_types_used   # ['ACTED_IN', 'KNOWS', 'LIVES_IN']
info.properties_used  # ['name']

This matters because the LLM pipeline's provenance step depends on it — without these collections, MERGE-generated edges would only attach to top-level pattern nodes, missing entities mentioned via EXISTS or pattern comprehensions.

Error reporting¶

When the query is syntactically invalid, is_valid is False, errors contains one or more formatted strings (each typically including line N:M), and the label/rel-type/property lists are empty:

info = parse_query("MATCH (n:Person RETURN n")
info.is_valid   # False
info.errors     # ["Parse error at line 1:17 — expected ')' ..."]

For richer diagnostics with positions and "did you mean?" hints, use CypherValidator — its parse errors are also represented as E101 ParseError diagnostics with position_line and position_col set.

What the parser actually understands¶

The PEG grammar covers the practical subset of Cypher used in production:

MATCH / OPTIONAL MATCH / WHERE
CREATE / MERGE (with ON CREATE SET / ON MATCH SET)
SET (assignment, property update, label addition)
REMOVE (property removal, label removal)
DELETE / DETACH DELETE
WITH (incl. WITH DISTINCT, alias, WHERE, ORDER BY, SKIP, LIMIT)
RETURN (incl. RETURN DISTINCT, alias, ORDER BY, SKIP, LIMIT)
UNWIND list AS item
UNION / UNION ALL
CALL { ... } subqueries (read or write)
CALL proc.name(...) standalone procedure calls
FOREACH (x IN list | actions)
Expressions: arithmetic, comparison, boolean, list, map, list comprehension, pattern comprehension, EXISTS, COUNT { ... }, COLLECT { ... }, reduce, shortestPath, allShortestPaths, function calls, parameters ($x), literals.
Patterns: node patterns with multi-labels and inline maps, relationship patterns with variable-length (*, *1..5), direction (->, <-, -), and inline rel maps.

If you hit a Cypher construct the parser does not understand, please open an issue with the failing query — the grammar is extended in src/grammar/cypher.pest and the AST builder in src/parser/builder.rs.