Parsing configuration

The included JSON files defines the parsing and database loading rules for ClinVar-related XML entities.

The configuration is intended for use in an automated XML ingestion pipeline. It provides a declarative description of how each entity should be interpreted, which ensures consistent parsing logic and reduces the need for hard-coded behaviour in the parsing code.

The configuration relies on XPath expressions to navigate and extract data from the XML structure.

The parsing instructions are directly related to the tables described in the schema configuration files, in the sense that the tables that are being populated should be defined in the schemas.

Overview

The configuration defines how ClinVar XML elements are mapped to SQLite database tables. Each table configuration specifies:

Which XML elements to extract
How to extract attribute values
Parent-child relationships between tables
SQL statement generation

Configuration Structure

Each subsequent table follows this structure:

"TableName": {
    "returns_id": "table_id" | false,
    "parent_id": "parent_table_id" | "entity_id" | null,
    "attributes": { /* ... */ },
    "children": [ /* ... */ ],
    "sql": { /* ... */ }
}

Key Fields Explained

The return_id and parent_id keys can be used to define foreign key relationships between tables.

returns_id

Specifies if this table returns a row ID for child tables to reference.

String value (e.g., "variation_archive_id"): Returns the auto-generated primary key after insertion. The string value is documentary — it indicates which parent_id column in child tables will receive this ID, but the actual column mapping is determined by each child’s parent_id configuration.
false: Does not return an ID (leaf tables, no children need to reference it)

parent_id

Defines the foreign key relationship to the parent table.

null: No parent. Use for root or top-level tables.
Column name (e.g., "variation_archive_id"): Standard foreign key relationship. The parent’s row ID is stored in this column.
"entity_id": Polymorphic relationship. The table can belong to multiple parent types. When used, two columns are populated:
- entity_id: The parent’s row ID
- entity_type: The parent’s type (e.g., "Measure", "Trait", "RCVClassification")
The entity_type value is specified in the parent’s children configuration.

Example — standard relationship:

"ClinVarSet": {
    "returns_id": "clinvar_set_id",
    "children": [
        {
        "xpath": "./ReferenceClinVarAssertion",
        "table": "ReferenceClinVarAssertion"
        }
    ]
}

"ReferenceClinVarAssertion": {
    "parent_id": "clinvar_set_id"
}

Example — polymorphic relationship:

"Measure": {
    "returns_id": "measure_id",
    "children": [
        {
        "xpath": "./XRef", "table": "XRef", "entity_type": "Measure"
        },
        {
        "xpath": "./Citation", "table": "Citation", "entity_type": "Measure"
        }
    ]
}

"Trait": {
    "returns_id": "trait_id",
    "children": [
        {
        "xpath": "./XRef", "table": "XRef", "entity_type": "Trait"
        },
        {
        "xpath": "./Citation", "table": "Citation", "entity_type": "Trait"
        }
    ]
}

"XRef": {
    "parent_id": "entity_id"
}

Here, XRef rows can belong to either Measure or Trait. The entity_type column distinguishes which parent type each row references, whilst entity_id holds the parent’s row ID.

Extracting Attributes

The attributes section maps database columns to XML values. Here the key should be the column in the .db, and the parsing instructions are provided as the key value (i.e. the RHS).

Extraction Methods

1. XML Attribute

Extract from element’s attribute:

"gene_id": {"xml_attr": "GeneID", "cast": "int"}

From: <Gene GeneID="1234">

2. XML Text Content

Extract element’s text:

"name": {"xml_path": "./Name", "xml_text": true}

From: <Name>BRCA1</Name>

3. Nested Attribute

Extract attribute from nested element:

"review_status": {
    "xml_path": "./Classification/ReviewStatus",
    "xml_text": true
}

From: <Classification><ReviewStatus>reviewed</ReviewStatus></Classification>

4. XML Tag Name

Use the element’s tag as the value:

"classification_type": {"xml_tag": true}

From: <GermlineClassification> → stores “GermlineClassification”

Type Casting

Add "cast" to convert values:

"int": Integer
"float": Floating point
"bool": Boolean (true if value is “true”, case-insensitive)
"str": String (default, can be omitted)

Defining Children

The children array specifies which nested XML elements to parse recursively. When a parent element is parsed, the parser iterates through each child specification, finds matching XML elements using the xpath, and recursively parses them into the specified table.

How Child Parsing Works

The parent element is inserted into the database
If returns_id is set, the parent’s row ID is captured
For each child specification, the parser finds all matching XML elements
Each matched element is recursively parsed, receiving the parent’s row ID as its foreign key

Basic Child

For straightforward parent-child relationships where the child table has a single parent type:

"Measure": {
    "returns_id": "measure_id",
    "children": [
        {"xpath": "./SequenceLocation", "table": "SequenceLocation"}
    ]
}

"SequenceLocation": {
    "parent_id": "measure_id"
}

Here, each <SequenceLocation> element found within <Measure> is parsed and inserted with measure_id set to the parent’s row ID.

Polymorphic Child (entity_type)

For tables that can belong to multiple parent types (where parent_id: "entity_id"), the entity_type field identifies which parent type the child belongs to:

"Measure": {
    "returns_id": "measure_id",
    "children": [
        {"xpath": "./XRef", "table": "XRef", "entity_type": "Measure"},
        {"xpath": "./Comment", "table": "Comment", "entity_type": "Measure"}
    ]
}

"Trait": {
    "returns_id": "trait_id",
    "children": [
        {"xpath": "./XRef", "table": "XRef", "entity_type": "Trait"},
        {"xpath": "./Comment", "table": "Comment", "entity_type": "Trait"}
    ]
}

"XRef": {
    "parent_id": "entity_id"
}

When an XRef is parsed as a child of Measure, two columns are populated:

entity_type = "Measure"
entity_id = <parent's row ID>

This allows a single XRef table to store cross-references from multiple parent types, with the entity_type column distinguishing which parent each row belongs to.

Multiple Child Types

A parent can define multiple child specifications, each targeting different XML paths or tables:

"ReferenceClinVarAssertion": {
    "returns_id": "reference_assertion_id",
    "children": [
        {"xpath": "./MeasureSet", "table": "MeasureSet"},
        {"xpath": "./TraitSet", "table": "TraitSet"},
        {"xpath": "./ObservedIn", "table": "ObservedIn"},
        {"xpath": "./Classifications/GermlineClassification", "table": "RCVClassification"},
        {"xpath": "./Classifications/SomaticClinicalImpact", "table": "RCVClassification"}
    ]
}

Note: multiple xpaths can map to the same table (e.g., both GermlineClassification and SomaticClinicalImpact are stored in RCVClassification).

SQL Configuration

"sql": {
    "statement": "INSERT INTO" | "INSERT OR IGNORE INTO",
    "table_name": "ActualTableName",
    "columns": ["col1", "col2", "col3"]
}

statement This can be any insert statement, common examples are:
- "INSERT INTO": Standard insert (fails on duplicates)
- "INSERT OR IGNORE INTO": Ignore duplicate key violations
table_name: The database table name
columns: Ordered list of columns (must match attributes + parent_id + entity_type if applicable)

Column Rules

The columns array defines which columns are included in the INSERT statement. Values are extracted by column name, so order is flexible. Each column must have a corresponding source:

The parent_id column (or entity_type and entity_id for polymorphic tables)
Keys defined in attributes

Polymorphic tables (parent_id: “entity_id”):

"columns": ["entity_type", "entity_id", "db", "xref_id", ...]

Regular tables:

"columns": ["clinvar_set_id", "clinvar_id", "accession", ...]

Root tables (parent_id: null):

"columns": ["type", "dated"]

Common Patterns

Pattern 1: Simple Leaf Table

"OtherName": {
    "returns_id": false,
    "parent_id": "simple_allele_id",
    "attributes": {
        "name": {"xml_text": true},
        "type": {"xml_attr": "Type"}
    },
    "children": [],
    "sql": {
        "statement": "INSERT OR IGNORE INTO",
        "table_name": "OtherName",
        "columns": ["simple_allele_id", "name", "type"]
    }
}

Pattern 2: Parent with Children

"Gene": {
    "returns_id": "gene_id",
    "parent_id": "simple_allele_id",
    "attributes": {
        "gene_id": {"xml_attr": "GeneID", "cast": "int"},
        "symbol": {"xml_attr": "Symbol"}
    },
    "children": [
        {"xpath": "./Location/SequenceLocation",
         "table": "SequenceLocation"}
    ],
    "sql": {
        "statement": "INSERT OR IGNORE INTO",
        "table_name": "Gene",
        "columns": ["simple_allele_id", "gene_id", "symbol"]
    }
}

Pattern 3: Polymorphic Table

"Comment": {
    "returns_id": false,
    "parent_id": "entity_id",
    "attributes": {
        "comment_type": {"xml_attr": "Type"},
        "value": {"xml_text": true}
    },
    "children": [],
    "sql": {
        "statement": "INSERT OR IGNORE INTO",
        "table_name": "Comment",
        "columns": ["entity_type", "entity_id", "comment_type", "value"]
    }
}

Meta Section

The _meta section is used to define a progress table in the .db so parsing can be stopped and re-started. As part of this it defines the root element which is an (indirect) parent to all subsequent XML tags. The RHS JSON tags can be changed if needed, the LHS tags should not be changed.

"_meta": {
    "progress_table": "VCVParseProgress",
    "root_element": "ClinVarVariationRelease",
    "record_element": "VariationArchive",
    "accession_attr": "Accession"
}

progress_table: Database table for tracking parsing progress
root_element: XML root tag (parsed once per file)
record_element: Repeating record tag (main parsing unit)
accession_attr: Attribute name for unique record identifier

Additionally one can include an optional accession_path tag to specify the Xpath to the record containing the accession code relative to the record element.

Common Modifications

Adding a New Attribute

Add to attributes:

"new_field": {"xml_attr": "NewAttribute"}

Add to sql.columns:

"columns": ["existing_col", "new_field"]

Adding a New Child Table

Define the child table configuration

Add to parent’s children:

{"xpath": "./NewElement", "table": "NewTable"}

Changing Extraction Method

To extract from nested element instead of attribute:

// Before
"field": {"xml_attr": "Value"}

// After
"field": {"xml_path": "./ValueElement", "xml_text": true}

XPath Tips

Relative paths: Start with ./ for immediate children
Descendant search: Use .// to search all descendants
Multiple levels: ./Parent/Child/Grandchild

Example:

// Immediate child
"xpath": "./Gene"

// Any descendant named Gene
"xpath": ".//Gene"

// Specific nested path
"xpath": "./ClassifiedRecord/SimpleAllele"

Validation Checklist

When modifying the configuration:

[ ] returns_id matches what children reference as parent_id
[ ] sql.columns includes all attributes keys
[ ] sql.columns includes parent_id column (if not null)
[ ] Polymorphic tables have entity_type and entity_id in columns
[ ] XPath expressions match actual XML structure
[ ] Type casts are appropriate for data types
[ ] Child entity_type values match parent table names