Parsing configuration
The included JSON files defines the parsing and database loading rules for ClinVar-related XML entities.
The configuration is intended for use in an automated XML ingestion pipeline. It provides a declarative description of how each entity should be interpreted, which ensures consistent parsing logic and reduces the need for hard-coded behaviour in the parsing code.
The configuration relies on XPath expressions to navigate and extract data from the XML structure.
The parsing instructions are directly related to the tables described in the schema configuration files, in the sense that the tables that are being populated should be defined in the schemas.
Overview
The configuration defines how ClinVar XML elements are mapped to SQLite database tables. Each table configuration specifies:
Which XML elements to extract
How to extract attribute values
Parent-child relationships between tables
SQL statement generation
Configuration Structure
Each subsequent table follows this structure:
"TableName": {
"returns_id": "table_id" | false,
"parent_id": "parent_table_id" | "entity_id" | null,
"attributes": { /* ... */ },
"children": [ /* ... */ ],
"sql": { /* ... */ }
}
Key Fields Explained
The return_id and parent_id keys can be used to define foreign key
relationships between tables.
returns_id
Specifies if this table returns a row ID for child tables to reference.
String value (e.g.,
"variation_archive_id"): Returns the auto-generated primary key after insertion. The string value is documentary — it indicates which parent_id column in child tables will receive this ID, but the actual column mapping is determined by each child’s parent_id configuration.false: Does not return an ID (leaf tables, no children need to reference it)
parent_id
Defines the foreign key relationship to the parent table.
null: No parent. Use for root or top-level tables.Column name (e.g.,
"variation_archive_id"): Standard foreign key relationship. The parent’s row ID is stored in this column."entity_id": Polymorphic relationship. The table can belong to multiple parent types. When used, two columns are populated:entity_id: The parent’s row IDentity_type: The parent’s type (e.g.,"Measure","Trait","RCVClassification")
The
entity_typevalue is specified in the parent’schildrenconfiguration.
Example — standard relationship:
"ClinVarSet": {
"returns_id": "clinvar_set_id",
"children": [
{
"xpath": "./ReferenceClinVarAssertion",
"table": "ReferenceClinVarAssertion"
}
]
}
"ReferenceClinVarAssertion": {
"parent_id": "clinvar_set_id"
}
Example — polymorphic relationship:
"Measure": {
"returns_id": "measure_id",
"children": [
{
"xpath": "./XRef", "table": "XRef", "entity_type": "Measure"
},
{
"xpath": "./Citation", "table": "Citation", "entity_type": "Measure"
}
]
}
"Trait": {
"returns_id": "trait_id",
"children": [
{
"xpath": "./XRef", "table": "XRef", "entity_type": "Trait"
},
{
"xpath": "./Citation", "table": "Citation", "entity_type": "Trait"
}
]
}
"XRef": {
"parent_id": "entity_id"
}
Here, XRef rows can belong to either Measure or Trait. The
entity_type column distinguishes which parent type each row references,
whilst entity_id holds the parent’s row ID.
Extracting Attributes
The attributes section maps database columns to XML values.
Here the key should be the column in the .db, and the parsing
instructions are provided as the key value (i.e. the RHS).
Extraction Methods
1. XML Attribute
Extract from element’s attribute:
"gene_id": {"xml_attr": "GeneID", "cast": "int"}
From: <Gene GeneID="1234">
2. XML Text Content
Extract element’s text:
"name": {"xml_path": "./Name", "xml_text": true}
From: <Name>BRCA1</Name>
3. Nested Attribute
Extract attribute from nested element:
"review_status": {
"xml_path": "./Classification/ReviewStatus",
"xml_text": true
}
From: <Classification><ReviewStatus>reviewed</ReviewStatus></Classification>
4. XML Tag Name
Use the element’s tag as the value:
"classification_type": {"xml_tag": true}
From: <GermlineClassification> → stores “GermlineClassification”
Type Casting
Add "cast" to convert values:
"int": Integer"float": Floating point"bool": Boolean (true if value is “true”, case-insensitive)"str": String (default, can be omitted)
Defining Children
The children array specifies which nested XML elements to parse
recursively. When a parent element is parsed, the parser iterates through
each child specification, finds matching XML elements using the xpath, and
recursively parses them into the specified table.
How Child Parsing Works
The parent element is inserted into the database
If
returns_idis set, the parent’s row ID is capturedFor each child specification, the parser finds all matching XML elements
Each matched element is recursively parsed, receiving the parent’s row ID as its foreign key
Basic Child
For straightforward parent-child relationships where the child table has a single parent type:
"Measure": {
"returns_id": "measure_id",
"children": [
{"xpath": "./SequenceLocation", "table": "SequenceLocation"}
]
}
"SequenceLocation": {
"parent_id": "measure_id"
}
Here, each <SequenceLocation> element found within <Measure> is parsed
and inserted with measure_id set to the parent’s row ID.
Polymorphic Child (entity_type)
For tables that can belong to multiple parent types (where
parent_id: "entity_id"), the entity_type field identifies which parent
type the child belongs to:
"Measure": {
"returns_id": "measure_id",
"children": [
{"xpath": "./XRef", "table": "XRef", "entity_type": "Measure"},
{"xpath": "./Comment", "table": "Comment", "entity_type": "Measure"}
]
}
"Trait": {
"returns_id": "trait_id",
"children": [
{"xpath": "./XRef", "table": "XRef", "entity_type": "Trait"},
{"xpath": "./Comment", "table": "Comment", "entity_type": "Trait"}
]
}
"XRef": {
"parent_id": "entity_id"
}
When an XRef is parsed as a child of Measure, two columns are populated:
entity_type = "Measure"entity_id = <parent's row ID>
This allows a single XRef table to store cross-references from multiple
parent types, with the entity_type column distinguishing which parent
each row belongs to.
Multiple Child Types
A parent can define multiple child specifications, each targeting different XML paths or tables:
"ReferenceClinVarAssertion": {
"returns_id": "reference_assertion_id",
"children": [
{"xpath": "./MeasureSet", "table": "MeasureSet"},
{"xpath": "./TraitSet", "table": "TraitSet"},
{"xpath": "./ObservedIn", "table": "ObservedIn"},
{"xpath": "./Classifications/GermlineClassification", "table": "RCVClassification"},
{"xpath": "./Classifications/SomaticClinicalImpact", "table": "RCVClassification"}
]
}
Note: multiple xpaths can map to the same table (e.g., both
GermlineClassification and SomaticClinicalImpact are stored in
RCVClassification).
SQL Configuration
"sql": {
"statement": "INSERT INTO" | "INSERT OR IGNORE INTO",
"table_name": "ActualTableName",
"columns": ["col1", "col2", "col3"]
}
statement This can be any insert statement, common examples are:
"INSERT INTO": Standard insert (fails on duplicates)"INSERT OR IGNORE INTO": Ignore duplicate key violations
table_name: The database table name
columns: Ordered list of columns (must match attributes + parent_id + entity_type if applicable)
Column Rules
The columns array defines which columns are included in the INSERT
statement. Values are extracted by column name, so order is flexible.
Each column must have a corresponding source:
The parent_id column (or entity_type and entity_id for polymorphic tables)
Keys defined in
attributes
Polymorphic tables (parent_id: “entity_id”):
"columns": ["entity_type", "entity_id", "db", "xref_id", ...]
Regular tables:
"columns": ["clinvar_set_id", "clinvar_id", "accession", ...]
Root tables (parent_id: null):
"columns": ["type", "dated"]
Common Patterns
Pattern 1: Simple Leaf Table
"OtherName": {
"returns_id": false,
"parent_id": "simple_allele_id",
"attributes": {
"name": {"xml_text": true},
"type": {"xml_attr": "Type"}
},
"children": [],
"sql": {
"statement": "INSERT OR IGNORE INTO",
"table_name": "OtherName",
"columns": ["simple_allele_id", "name", "type"]
}
}
Pattern 2: Parent with Children
"Gene": {
"returns_id": "gene_id",
"parent_id": "simple_allele_id",
"attributes": {
"gene_id": {"xml_attr": "GeneID", "cast": "int"},
"symbol": {"xml_attr": "Symbol"}
},
"children": [
{"xpath": "./Location/SequenceLocation",
"table": "SequenceLocation"}
],
"sql": {
"statement": "INSERT OR IGNORE INTO",
"table_name": "Gene",
"columns": ["simple_allele_id", "gene_id", "symbol"]
}
}
Pattern 3: Polymorphic Table
"Comment": {
"returns_id": false,
"parent_id": "entity_id",
"attributes": {
"comment_type": {"xml_attr": "Type"},
"value": {"xml_text": true}
},
"children": [],
"sql": {
"statement": "INSERT OR IGNORE INTO",
"table_name": "Comment",
"columns": ["entity_type", "entity_id", "comment_type", "value"]
}
}
Meta Section
The _meta section is used to define a progress table in the .db so parsing
can be stopped and re-started.
As part of this it defines the root element which is an (indirect) parent to
all subsequent XML tags.
The RHS JSON tags can be changed if needed, the LHS tags should not be changed.
"_meta": {
"progress_table": "VCVParseProgress",
"root_element": "ClinVarVariationRelease",
"record_element": "VariationArchive",
"accession_attr": "Accession"
}
progress_table: Database table for tracking parsing progress
root_element: XML root tag (parsed once per file)
record_element: Repeating record tag (main parsing unit)
accession_attr: Attribute name for unique record identifier
Additionally one can include an optional accession_path tag to specify the
Xpath to the record containing the accession code relative to the record
element.
Common Modifications
Adding a New Attribute
Add to
attributes:"new_field": {"xml_attr": "NewAttribute"}
Add to
sql.columns:"columns": ["existing_col", "new_field"]
Adding a New Child Table
Define the child table configuration
Add to parent’s
children:{"xpath": "./NewElement", "table": "NewTable"}
Changing Extraction Method
To extract from nested element instead of attribute:
// Before
"field": {"xml_attr": "Value"}
// After
"field": {"xml_path": "./ValueElement", "xml_text": true}
XPath Tips
Relative paths: Start with
./for immediate childrenDescendant search: Use
.//to search all descendantsMultiple levels:
./Parent/Child/Grandchild
Example:
// Immediate child
"xpath": "./Gene"
// Any descendant named Gene
"xpath": ".//Gene"
// Specific nested path
"xpath": "./ClassifiedRecord/SimpleAllele"
Validation Checklist
When modifying the configuration:
[ ]
returns_idmatches what children reference asparent_id[ ]
sql.columnsincludes allattributeskeys[ ]
sql.columnsincludesparent_idcolumn (if not null)[ ] Polymorphic tables have
entity_typeandentity_idin columns[ ] XPath expressions match actual XML structure
[ ] Type casts are appropriate for data types
[ ] Child
entity_typevalues match parent table names