Generator¶

CypherGenerator produces valid Cypher queries from a schema. It is useful for:

Building few-shot examples for an LLM prompt (see few_shot_examples).
Synthetic load testing and benchmarking the validator.
Property-based testing: round-trip a generated query through the validator and assert validity.

Like the validator and parser, the generator lives in Rust for speed (~57 000 queries/s).

Constructing a generator¶

from cypher_validator import Schema, CypherGenerator

schema = Schema(
    nodes={"Person": ["name", "age"], "Movie": ["title", "year"]},
    relationships={"ACTED_IN": ("Person", "Movie", ["role"])},
)

gen = CypherGenerator(schema, seed=42)   # seed is optional

Parameter	Type	Default	Meaning
`schema`	`Schema`	required	Drives label / rel-type / property choices
`seed`	`int \\| None`	`None`	When set, output is fully deterministic

Precomputation

On construction the generator clones the schema's label list, rel-type list, and per-label property lists into owned Vec<String>s, so every generate() call avoids reallocating from the schema's internal HashMaps.

Supported query types¶

CypherGenerator.supported_types()

Thirteen templates are supported:

Type	Shape	Example
`match_return`	`MATCH (n:L) RETURN n [LIMIT …]`	`MATCH (n:Person) RETURN n LIMIT 23`
`match_where_return`	`MATCH (n:L) WHERE n.p = v RETURN n.p`	`MATCH (n:Person) WHERE n.name = $name RETURN n.name`
`create`	`CREATE (n:L {…}) RETURN n`	`CREATE (n:Movie {title: "Dune"}) RETURN n`
`merge`	`MERGE (n:L {…}) RETURN n`	`MERGE (n:Person {name: "Alice", age: 30}) RETURN n`
`aggregation`	`MATCH (n:L) RETURN count(n[.p]) AS result`	`MATCH (n:Person) RETURN count(n.name) AS result`
`match_relationship`	`[OPTIONAL] MATCH (a:L1)-[r:R]->(b:L2) RETURN a, r, b`	`MATCH (a:Person)-[r:ACTED_IN]->(b:Movie) RETURN a, r, b`
`create_relationship`	`MATCH (a:L1),(b:L2) CREATE (a)-[r:R]->(b) RETURN r`	as above
`match_set`	`MATCH (n:L) SET n.p = v RETURN n`	`MATCH (n:Person) SET n.age = 31 RETURN n`
`match_delete`	`MATCH (n:L) DETACH DELETE n`	`MATCH (n:Person) DETACH DELETE n`
`with_chain`	`MATCH (n:L) WITH n.p AS val RETURN count(*)`	uses `WITH` to project
`distinct_return`	`MATCH (n:L) RETURN DISTINCT n[.p] [LIMIT …]`	`MATCH (n:Person) RETURN DISTINCT n.name`
`order_by`	`MATCH (n:L) RETURN n ORDER BY n.p [DESC] [LIMIT …]`	sortable test cases
`unwind`	`UNWIND [list] AS item RETURN item` (or `MATCH … UNWIND n.p AS item`)	list expansion

`generate(query_type) → str`¶

gen.generate("match_return")
# 'MATCH (n:Movie) RETURN n LIMIT 14'

gen.generate("create_relationship")
# 'MATCH (a:Person),(b:Movie) CREATE (a)-[r:ACTED_IN]->(b) RETURN r'

If query_type is not one of the supported names, a ValueError is raised (CypherError::GeneratorError on the Rust side).

`generate_batch(query_type, n) → list[str]`¶

Avoids per-call Python overhead when you need many queries of the same shape:

batch = gen.generate_batch("match_where_return", 1000)
assert len(batch) == 1000

Internally this loops in Rust without crossing the Python/Rust boundary per query.

Determinism via `seed`¶

When constructed with a seed, the RNG is a SmallRng::seed_from_u64(seed) — a PCG-family stream that is reproducible across runs and platforms:

g1 = CypherGenerator(schema, seed=7)
g2 = CypherGenerator(schema, seed=7)
assert g1.generate("merge") == g2.generate("merge")

Without a seed, the RNG is initialised from the OS entropy source (SmallRng::from_os_rng) and you get fresh output on every run.

Sequence dependence

The RNG state is per-instance and advances on every call. Two generators with the same seed will diverge if you call them in a different order. For maximum determinism in tests, construct a fresh generator per scenario.

How values are generated¶

Scalar values cycle through:

String literals — "Alice", "Bob", "Carol", "Neo4j", "hello"
Integers in [1, 100]
Booleans — true, false
Parameter references — $name, $id, $value, $limit

Property maps pick 1–3 properties per label (capped at the label's actual property count), and LIMIT n is appended with probability 1/3.

Relationship endpoints always honour the schema — gen_create_relationship looks up (src, tgt) from the rel-type's declared endpoints, so a generated query like (a:Person)-[r:ACTED_IN]->(b:Movie) is guaranteed to validate against the same schema.

Round-tripping with the validator¶

A standard sanity check:

from cypher_validator import Schema, CypherGenerator, CypherValidator

schema = Schema(...)
gen = CypherGenerator(schema, seed=0)
val = CypherValidator(schema)

for qt in CypherGenerator.supported_types():
    for query in gen.generate_batch(qt, 100):
        result = val.validate(query)
        assert result.is_valid, f"{qt} produced invalid query: {query}\n{result.errors}"

The validator's microbench (tests/test_perf_*.py) uses exactly this round-trip to feed its workload.

Few-shot examples for LLMs¶

The few_shot_examples helper wraps the generator to produce labelled (description, cypher) pairs:

from cypher_validator import CypherGenerator
from cypher_validator.llm_utils import few_shot_examples

gen = CypherGenerator(schema, seed=0)
for desc, cypher in few_shot_examples(gen, n=5):
    print(f"Q: {desc}")
    print(f"A: {cypher}\n")