Schema generator

ClinVar SQLite Schema Generator.

This module provides schema generation for ClinVar databases: - RCV (Reference ClinVar Assertion) - condition-centric format - VCV (Variant Call Variation) - variant-centric format

The SQL schemas are based on the following XSD: - RCV: ClinVar_RCV_weekly.xsd v2.2 (August 6, 2025) - VCV: ClinVar_VCV.xsd v2.5 (August 6, 2025)

Notes

This module creates only the database schema structure (empty tables). To populate databases with actual ClinVar data, use clinvar_parser.py.

class clinvar_build.schema_generator.ClinVarSchemaGenerator[source]

Builder for ClinVar SQLite database schemas.

This class provides methods to create database schemas for ClinVar RCV (condition-centric), and VCV (variant-centric), It handles connecting to the SQLite database, executing table and index creation, and safely closing the connection.

conn

Active SQLite connection, or None if not connected.

Type:

sqlite3.Connection or None

cursor

Cursor for executing SQL statements, or None if not connected.

Type:

sqlite3.Cursor or None

create_schema(db_path, db_config, db_indices=None, name=None)[source]

Create a SQLite schema for a given ClinVar database.

__init__()[source]

Initialise the schema builder.

__repr__() str[source]

Return unambiguous string representation.

Returns:

String representation suitable for debugging

Return type:

str

__str__() str[source]

Return human-readable string representation.

Returns:

Human-readable description of the builder

Return type:

str

__weakref__

list of weak references to the object

create_schema(db_path: str | Path, db_config: dict[str, str], db_indices: dict[str, str] | None, name: str | None = None) None[source]

Create a SQLite schema for ClinVar (RCV, VCV).

Parameters:
  • db_path (str or Path) – Path to SQLite database file.

  • db_config (dict [str, str]) – Dictionary with SQL CREATE TABLE statements.

  • db_indices (dict [str, str] or None, default None) – Dictionary with SQL CREATE INDEX statements.

  • name (str or None, default None) – Optional name for logging. Defaults to the filename stem of db_path.

Return type:

None

clinvar_build.schema_generator.main()[source]

Command-line interface for schema generation.

Examples

# Create RCV schema python clinvar_schema.py /data –rcv

# Create create RCV and VCV schemas python clinvar_schema.py /data –rcv –vcv

clinvar_build.schema_generator.parse_arguments()[source]

Parse command-line arguments for schema generation.

Returns:

Parsed command-line arguments containing directory, rcv, and vcv attributes.

Return type:

argparse.Namespace

Parser

ClinVar XML to SQLite database parser

This module provides a comprehensive parser for converting ClinVar XML files into SQLite databases. It implements a configuration-driven architecture that supports both RCV (Reference ClinVar) and VCV (Variation ClinVar) XML formats through JSON-based table and column specifications.

The parser is implemented using iterative parsing and batch commits, minimising memory usage. Progress tracking is enabled through a SQL table, allowing for restarts after the last commited record.

class clinvar_build.parser.ClinVarParser(config: dict[str, Any] | None = None)[source]

Parser for ClinVar XML files.

Parameters:

config (dict[str, Any]) – A dictionary with instructions to parser an XML file to a SQLite database.

conn

Active SQLite connection, or None if not connected.

Type:

sqlite3.Connection or None

cursor

Cursor for executing SQL statements, or None if not connected.

Type:

sqlite3.Cursor or None

stats

Statistics on parsed records

Type:

dict [str, int]

Class Attributes
----------------
_SQL_COLS_PATTERN

Compiled regex to extract column names from SQL INSERT statements

Type:

re.Pattern

parse_file(xml_path: str | Path, db_path: str | Path, batch_size: int = 10, validate_foreign_keys: bool = True, enforce_foreign_keys: bool = False, xsd_path: str | Path | None = None, xsd_strict: bool = False, resume: bool = True, count_duplicates: bool = False) None[source]

Parse ClinVar XML file into SQLite database.

Parameters:
  • xml_path (str or Path) – Path to ClinVar XML file

  • db_path (str or Path) – Path to SQLite database file

  • batch_size (int, default 10) – Number of top-level records (including children) to commit at once.

  • validate_foreign_keys (bool, default True) – Whether to validate foreign key integrity after parsing

  • enforce_foreign_keys (bool, default False) – Whether to enforce foreign keys during parsing (slower but catches errors immediately). If False, foreign keys are only validated after parsing completes.

  • xsd_path (str, Path, or None, default None) – Optional path to XSD schema for XML validation

  • xsd_strict (bool, default False) – If False, allows XML elements not defined in XSD

  • resume (bool, default True) – If True, attempt to resume from last checkpoint. If no checkpoint exists, starts fresh. If False, always starts fresh (existing data may cause constraint violations)

  • count_duplicates (bool, default False) – After building the database print potential duplicated rows per table

clinvar_build.parser.main()[source]

Main entry point for ClinVar XML parser.

Parses command-line arguments, loads configuration, initializes parser, and processes ClinVar XML file into SQLite database.

clinvar_build.parser.parse_arguments()[source]

Parse command-line arguments for database parsing.

Returns:

Parsed command-line arguments containing:

  • db : Database file path

  • xml : XML file path

  • xsd : XSD schema file path (optional)

  • rcv : Whether to create RCV schema

  • vcv : Whether to create VCV schema

  • batch_size : Number of records to commit at once

  • enforce_fk : Whether to enforce foreign keys during parsing

  • no_validation : Whether to skip post-build validation

  • count_duplicates : Whether to count duplicate rows

  • verbose : Verbosity level (0-3)

Return type:

argparse.Namespace

General utils

General utility functions for the clinvar-build module

clinvar_build.utils.general.assign_empty_default(arguments: list[Any], empty_object: Callable[[], Any]) list[Any][source]

Takes a list of arguments, checks if these are NoneType and if so asigns them ‘empty_object’.

Parameters:
  • arguments (list [any]) – A list of arguments which may be set to NoneType.

  • empty_object (Callable)

  • object (A function that returns a mutable) – Examples include a list or a dict.

Returns:

new_arguments – List with NoneType replaced by empty mutable object.

Return type:

list

Examples

>>> assign_empty_default(['hi', None, 'hello'], empty_object=list)
['hi', [], 'hello']

Notes

This function helps deal with the pitfall of assigning an empty mutable object as a default function argument, which would persist through multiple function calls, leading to unexpected/undesired behaviours.

Configuration tools

Configuration parsing and XML validation utilities for ClinVar Build.

This module provides utilities for parsing configuration files, and managing logging output for long-running operations. It includes classes for handling block-based configuration files and property management with controlled access.

class clinvar_build.utils.config_tools.BlockConfigParser(path: str)[source]

Parses configuration files with headers and concatenated content, essentially parsing a block of text to a single string.

This parser identifies sections marked by headers wrapped in square brackets and appends all subsequent lines (until the next header) into a single string for each section.

Parameters:

path (str or Path) – Path to the configuration file to be parsed.

parsed_data

The parsed data.

Type:

dict [str, str]

Examples

Given a configuration file:

[header1] This is line one This is line two

[header2] Another section With multiple lines

The parser will create:
{

‘header1’: ‘This is line onenThis is line two’, ‘header2’: ‘Another sectionnWith multiple lines’

}

__call__(section: str | None = None, sep='\n') Self[source]

Parse the configuration file into sections with concatenated content.

sectionstr or None, default None

The specific section to parse from the configuration file. If provided, only content within the matching section delimiters ([[section]]) will be processed. If None (default), all sections are parsed.

sep : str, default `

`

The separator to indicate a new line.

Self

The parser instance with populated data.

__eq__(other)[source]

Determine how instances are compared.

__init__(path: str)[source]

Initialize the ConfigParser instance.

__repr__()[source]

Developer-friendly representation of the parsed data.

__str__()[source]

String representation of the parsed data.

class clinvar_build.utils.config_tools.ManagedProperty(name: str, types: tuple[type] | type | None = None)[source]

A generic property factory defining setters and getters, with optional type validation.

Parameters:
  • name (str) – The name of the setters and getters

  • types (Type, default NoneType) – Either a single type, or a tuple of types to test against.

enable_setter()[source]

Enables the setter for the property, allowing attribute assignment.

disable_setter()[source]

Disables the setter for the property, making the property read-only.

set_with_setter(instance, value)[source]

Enables the setter, sets the property value, and then disables the setter, ensuring controlled updates.

Returns:

A property object with getter and setter.

Return type:

property

__get__(instance, owner)[source]

Getter for the property.

__init__(name: str, types: tuple[type] | type | None = None)[source]

Initialize the ManagedProperty.

__set__(instance, value)[source]

Setter for the property.

disable_setter()[source]

Disable the setter for the property.

enable_setter()[source]

Enable the setter for the property.

set_with_setter(instance, value)[source]

Enable the setter, set the property value, and then disable the setter.

Parameters:
  • instance (object) – The instance on which the property is being set.

  • value (any) – The value to assign to the property.

class clinvar_build.utils.config_tools.ProgressHandler(stream=None)[source]

Custom handler that updates progress in place.

Uses ANSI escape codes to overwrite previous output instead of printing new lines. Useful for progress updates during long-running operations.

Parameters:

stream (file-like object, optional) – Output stream. Defaults to sys.stdout.

_last_line_count

Number of lines in the previous message, used to calculate how far to move the cursor up.

Type:

int

Examples

>>> progress_logger = logging.getLogger('progress')
>>> handler = ProgressHandler()
>>> handler.setFormatter(logging.Formatter('%(asctime)s - %(message)s'))
>>> progress_logger.addHandler(handler)
>>> progress_logger.info("Processing: 100 records")
>>> progress_logger.info("Processing: 200 records")  # Overwrites previous
emit(record: LogRecord) None[source]

Emit a log record with in-place updating.

Parameters:

record (logging.LogRecord) – The log record to emit

reset() None[source]

Reset line count.

Call this when switching from in-place updates to normal logging to prevent cursor position issues.

clinvar_build.utils.config_tools.check_environ(environ_variable: str = 'CLINVAR_BUILD_CONFIG', fall_back: str | Path | None = PosixPath('/usr/local/share/clinvar_build/clinvar_config')) str[source]

Retrieve an environment variable pointing to a directory path, with optional fallback path.

Attempts to retrieve the specified environment variable. If the variable is not set, the function will attempt to use the fallback path if provided. This is useful for configuration management where environment variables may not always be explicitly set.

Parameters:
  • environ_variable (str) – The name of the environment variable to retrieve.

  • fall_back (str, Path or None) – A fallback path to return if the environment variable is not set. If None, an error will be raised when the environment variable is missing.

Returns:

A directory path.

Return type:

str

Raises:
  • KeyError – Raised when the environment variable is not set and fall_back is None.

  • TypeError – Raised when environ_variable is not of type str or fall_back is not of type str, Path, or None.

Notes

The function will not check whether the path is available or whether permissions allow for read or write access

Warning

UserWarning

Issued when the environment variable is not set and the fallback path is used instead.

Examples

>>> check_environ("MY_VAR")
'/path/to/default/config'

Parser tools

XML parsing and SQLite database utilities for ClinVar Build.

This module provides classes and functions for parsing large XML clinvar files and loading them into SQLite databases. It includes utilities for database validation, progress tracking, error formatting, and XML inspection.

class clinvar_build.utils.parser_tools.SQLiteParser(config: dict[str, Any] | None = None)[source]

A general SQLite parser.

Parameters:

config (dict[str, Any]) – A dictionary with instructions to parser an XML file to a SQLite database.

conn

Active SQLite connection, or None if not connected.

Type:

sqlite3.Connection or None

cursor

Cursor for executing SQL statements, or None if not connected.

Type:

sqlite3.Cursor or None

stats

Statistics on parsed records.

Type:

dict [str, int]

config

A configuration dictionary with parsing instructions.

Type:

dict [str, any]

__init__(config: dict[str, Any] | None = None)[source]

Initialize parser.

__repr__() str[source]

Return unambiguous string representation.

Returns:

String representation suitable for debugging

Return type:

str

__str__() str[source]

Return human-readable string representation.

Returns:

Human-readable description of the parser

Return type:

str

count_duplicates()[source]

Count duplicate rows in all tables based on non-id columns.

For each table in the configuration, identifies duplicate rows where all columns (except the primary key ‘id’) are identical, treating NULL values as equal.

Returns:

Dictionary mapping table names to number of duplicates found

Return type:

dict [str, int]

Notes

Uses SQLite’s rowid to identify duplicates, counting rows that would be removed (keeping the row with lowest rowid). NULL values are treated as equal via GROUP BY.

validate_database() dict[str, Any][source]

Run comprehensive database validation checks.

Performs multiple validation checks including foreign key integrity, table row counts, and basic statistics.

Returns:

Dictionary containing validation results and statistics

Return type:

dict [str, Any]

Raises:

ValueError – If foreign key violations are found

Examples

>>> with parser._connection(db_path):
...     results = parser.validate_database()
>>> print(results)
{'foreign_keys': 'valid', 'total_records': 12345, ...}
class clinvar_build.utils.parser_tools.ViewXML(xml_path: str | Path, tag_name: str, index: int = 0)[source]

Class to view and inspect XML records.

This class loads a single XML element and provides utilities for inspecting and navigating its structure, particularly useful for large ClinVar XML files.

Parameters:
  • xml_path (str or Path) – Path to ClinVar XML file (supports .gz compression)

  • tag_name (str) – XML tag name to search for (e.g., ‘VariationArchive’)

  • index (int, default 0) – Which occurrence of the tag to load (0-indexed)

xml_path

Path to the source XML file

Type:

Path

tag_name

The tag name that was searched for

Type:

str

index

The index of the loaded element

Type:

int

element

The loaded XML element, or None if not found

Type:

ET.Element or None

show_tree(max_depth=4, indent=0)[source]

Display element tree structure up to specified depth

show_children(indent=0)[source]

Display immediate children with summary information

find_all_paths()[source]

Find all unique XPath-like paths in the element tree

count_all_tags()[source]

Count all descendant tags by frequency

Examples

>>> viewer = ViewXML('clinvar.xml.gz', 'VariationArchive', index=5)
>>> viewer.show_tree()
>>> paths = viewer.find_all_paths()
__init__(xml_path: str | Path, tag_name: str, index: int = 0)[source]

Initialise ViewXML with a single XML element.

__repr__() str[source]

Return unambiguous string representation.

Returns:

String that could recreate the object

Return type:

str

__str__() str[source]

Return human-readable string representation.

Returns:

Summary of the loaded element

Return type:

str

count_all_tags() dict[str, int][source]

Count all descendant tags in loaded element tree.

Returns:

dict of {str – Dictionary mapping tag names to their occurrence counts, ordered by frequency (most common first)

Return type:

int}

Raises:

ValueError – If no element is loaded

find_all_paths() list[str][source]

Find all unique XPath-like paths in the loaded element tree.

Returns:

Sorted list of all unique paths found in the tree

Return type:

list of str

Raises:

ValueError – If no element is loaded

Examples

>>> viewer = ViewXML('clinvar.xml.gz', 'VariationArchive')
>>> paths = viewer.find_all_paths()
>>> print(paths[:5])
show_children(indent: int = 0) None[source]

Display immediate children of loaded element with summary info.

Parameters:

indent (int, default 0) – Indentation level for output formatting

Raises:

ValueError – If no element is loaded

show_tree(max_depth=4, indent=0)[source]

” Display element tree structure up to specified depth.

Parameters:
  • max_depth (int, default 4) – Maximum depth to display

  • indent (int, default 0) – Starting indentation level

Raises:

ValueError – If no element is loaded

clinvar_build.utils.parser_tools.configure_logging(verbosity: int) None[source]

Configure logging based on verbosity level.

This function configures the root logger, which affects all loggers in the application through inheritance.

Parameters:

verbosity (int) – Number of -v flags. 0 = WARNING, 1 = INFO, 2 = DEBUG, 3 = TRACE

clinvar_build.utils.parser_tools.open_xml_file(xml_path: str | Path, xsd_path: str | Path | None = None, strict: bool = True, verbose: bool = True)[source]

Helper function to open an XML file, handling compression based on extension.

Automatically closes the file handle when exiting the context, even if an exception occurs.

Parameters:
  • xml_path (str) – The path to the file.

  • xsd_path (str, Path, or None, default None) – Optional path to XSD schema file for validation. If provided, validates XML before yielding the file handle.

  • strict (bool, default True) – If False, ignores elements in the XML that are not in the XSD. Only used when xsd_path is provided.

  • verbose (bool, default True) – If True, warns about validation issues when strict=False. Only used when xsd_path is provided.

Yields:

file-like object – An open file handle for the compressed or uncompressed XML.

Raises:
  • FileNotFoundError – If xml_path or xsd_path does not exist

  • IOError – If file cannot be opened

  • XMLValidationError – If XSD validation fails (when xsd_path is provided)

clinvar_build.utils.parser_tools.trace(self, message, *args, **kwargs)[source]

custom trace method for detailed logging.

clinvar_build.utils.parser_tools.validate_xml(xml_path: str | Path, xsd_path: str | Path, strict: bool = True, verbose: bool = True) _ElementTree[source]

Validates an XML file against an XSD schema.

Parameters:
  • xml_path (str or Path) – Path to the XML file.

  • xsd_path (str or Path) – Path to the XSD file.

  • strict (bool, default True) – If False, ignores elements in the XML that are not in the XSD.

Returns:

The parsed XML document.

Return type:

etree._ElementTree

Raises:

XMLValidationError – Raised if the XSD and XML are incompatible.