MONDO Prioritizer
The MONDO prioritizer builds a curation queue from MONDO disease rows rather than from ad hoc checklist issues. It scores candidate diseases against local dismech coverage, then applies a small set of explicit specificity heuristics to suggest whether a term should be curated as a root, lumped into a parent, or dropped.
The goal is practical queue construction:
- reward missing diseases that already have useful MONDO metadata
- reward external evidence signals when they are available in the input rows
- penalize terms that are already curated, obsolete, or likely bad roots
- make every score component inspectable and easy to retune
CLI
uv run dismech-mondo-prioritize \
examples/mondo_prioritizer_candidates.tsv \
--format table
Example with a custom config and TSV output:
uv run dismech-mondo-prioritize \
examples/mondo_prioritizer_candidates.tsv \
--config examples/mondo_prioritizer_config.yaml \
--format tsv \
--output /tmp/mondo_priority.tsv
Generate the static website dashboard from the same scoring logic:
just gen-priority-dashboard
This writes dashboard/priority.html and dashboard/priority.json, and patches
dashboard/index.html with a link when that dashboard index already exists.
For a local-only run across all MONDO disease descendants currently missing from
kb/disorders/, use:
just gen-priority-dashboard-all-mondo
That flow first exports candidate rows from the local MONDO sqlite database at
~/.data/oaklib/mondo.db, then writes the resulting dashboard under
tmp/priority-dashboard-all-mondo/. The tmp/ tree is gitignored so these
large artifacts stay out of GitHub by default.
Expected Input
The prioritizer accepts tsv, csv, json, or jsonl candidate rows. The
only required fields are:
mondo_idlabel
Optional fields become weighted signals when present:
definitionsynonymsparentsxrefschild_countclingen_definitive_countclingen_strong_countsubset_match_countorphanet_match_countis_obsolete
Field aliases such as id, name, disease_mondo, and term_label are also
accepted. Multi-valued columns can use ; or |.
Scoring Dimensions
Default weights live in conf/mondo_prioritizer.yaml.
By default the scorer uses:
missing_from_dismech: positive base reward for uncurated rootshas_definition: reward for a candidate that already has a MONDO definitionsynonym_count,xref_count: small metadata completeness rewardsclingen_definitive_count,clingen_strong_count: optional evidence boosts when those counts are supplied in the MONDO candidate exportsubset_match_count,orphanet_match_count: optional prioritization boosts when the upstream pipeline already tagged the disease as belonging to an important slicealready_curated_penalty,obsolete_penalty: strong penalties for terms we should not queue
Count-based features are capped so a row with very large synonym or xref counts does not dominate the queue.
Specificity Heuristics
The prioritizer also emits a specificity_bucket and recommended_action.
Default buckets are:
already_curatedobsoletegrouping_termsubtype_seriesbroad_parentover_specific_leafroot_candidate
The default heuristic rules are intentionally conservative:
grouping_term: regex match on high-confidence phrases such assusceptibility,predisposition, orobsoletesubtype_series: label matches a configurable subtype suffix pattern such aslong QT syndrome 1orNoonan syndrome 1, and the same stem appears multiple timesbroad_parent: candidate has many children and is not already classified as a subtype series or grouping termover_specific_leaf: leaf term with a long conjunction-heavy label, which is a useful review flag for phenotype-enumerating leaves
This does not try to solve lump-vs-split perfectly. It surfaces explicit review
recommendations such as CURATE_ROOT_WITH_SUBTYPES, LUMP_INTO_PARENT, and
REVIEW_AGAINST_PARENT so the queue is defensible and inspectable.
Example Interpretation
With the bundled example candidates:
long QT syndromescores as a missing broad parent and is recommended asCURATE_ROOT_WITH_SUBTYPESlong QT syndrome 1andlong QT syndrome 3score asLUMP_INTO_PARENTautism, susceptibility to, X-linked 3is flagged as agrouping_termNoonan syndromeis heavily penalized asALREADY_CURATEDobsolete unclassified cardiomyopathyis dropped as obsolete
Why This Helps
Issue-driven checklists are useful prompts, but they mix roots, subtype series, grouping terms, and questionable leaves. A MONDO-first prioritizer gives us a reusable way to:
- start from MONDO terms and metadata
- layer in optional external evidence counts
- compare against current dismech coverage
- generate a tuneable queue with transparent reasons
That makes it easier to review prioritization choices and to adjust weights as our curation objectives change.