Schema generator
ClinVar SQLite Schema Generator.
This module provides schema generation for ClinVar databases: - RCV (Reference ClinVar Assertion) - condition-centric format - VCV (Variant Call Variation) - variant-centric format
The SQL schemas are based on the following XSD: - RCV: ClinVar_RCV_weekly.xsd v2.2 (August 6, 2025) - VCV: ClinVar_VCV.xsd v2.5 (August 6, 2025)
Notes
This module creates only the database schema structure (empty tables). To populate databases with actual ClinVar data, use clinvar_parser.py.
- class clinvar_build.schema_generator.ClinVarSchemaGenerator[source]
Builder for ClinVar SQLite database schemas.
This class provides methods to create database schemas for ClinVar RCV (condition-centric), and VCV (variant-centric), It handles connecting to the SQLite database, executing table and index creation, and safely closing the connection.
- conn
Active SQLite connection, or None if not connected.
- Type:
sqlite3.Connection or None
- cursor
Cursor for executing SQL statements, or None if not connected.
- Type:
sqlite3.Cursor or None
- create_schema(db_path, db_config, db_indices=None, name=None)[source]
Create a SQLite schema for a given ClinVar database.
- __repr__() str[source]
Return unambiguous string representation.
- Returns:
String representation suitable for debugging
- Return type:
- __str__() str[source]
Return human-readable string representation.
- Returns:
Human-readable description of the builder
- Return type:
- __weakref__
list of weak references to the object
- create_schema(db_path: str | Path, db_config: dict[str, str], db_indices: dict[str, str] | None, name: str | None = None) None[source]
Create a SQLite schema for ClinVar (RCV, VCV).
- Parameters:
db_path (str or Path) – Path to SQLite database file.
db_config (dict [str, str]) – Dictionary with SQL CREATE TABLE statements.
db_indices (dict [str, str] or None, default None) – Dictionary with SQL CREATE INDEX statements.
name (str or None, default None) – Optional name for logging. Defaults to the filename stem of db_path.
- Return type:
None
- clinvar_build.schema_generator.main()[source]
Command-line interface for schema generation.
Examples
# Create RCV schema python clinvar_schema.py /data –rcv
# Create create RCV and VCV schemas python clinvar_schema.py /data –rcv –vcv
Parser
ClinVar XML to SQLite database parser
This module provides a comprehensive parser for converting ClinVar XML files into SQLite databases. It implements a configuration-driven architecture that supports both RCV (Reference ClinVar) and VCV (Variation ClinVar) XML formats through JSON-based table and column specifications.
The parser is implemented using iterative parsing and batch commits, minimising memory usage. Progress tracking is enabled through a SQL table, allowing for restarts after the last commited record.
- class clinvar_build.parser.ClinVarParser(config: dict[str, Any] | None = None)[source]
Parser for ClinVar XML files.
- Parameters:
config (dict[str, Any]) – A dictionary with instructions to parser an XML file to a SQLite database.
- conn
Active SQLite connection, or None if not connected.
- Type:
sqlite3.Connection or None
- cursor
Cursor for executing SQL statements, or None if not connected.
- Type:
sqlite3.Cursor or None
- stats
Statistics on parsed records
- Type:
dict [str, int]
- Class Attributes
- ----------------
- _SQL_COLS_PATTERN
Compiled regex to extract column names from SQL INSERT statements
- Type:
- parse_file(xml_path: str | Path, db_path: str | Path, batch_size: int = 10, validate_foreign_keys: bool = True, enforce_foreign_keys: bool = False, xsd_path: str | Path | None = None, xsd_strict: bool = False, resume: bool = True, count_duplicates: bool = False) None[source]
Parse ClinVar XML file into SQLite database.
- Parameters:
xml_path (str or Path) – Path to ClinVar XML file
db_path (str or Path) – Path to SQLite database file
batch_size (int, default 10) – Number of top-level records (including children) to commit at once.
validate_foreign_keys (bool, default True) – Whether to validate foreign key integrity after parsing
enforce_foreign_keys (bool, default False) – Whether to enforce foreign keys during parsing (slower but catches errors immediately). If False, foreign keys are only validated after parsing completes.
xsd_path (str, Path, or None, default None) – Optional path to XSD schema for XML validation
xsd_strict (bool, default False) – If False, allows XML elements not defined in XSD
resume (bool, default True) – If True, attempt to resume from last checkpoint. If no checkpoint exists, starts fresh. If False, always starts fresh (existing data may cause constraint violations)
count_duplicates (bool, default False) – After building the database print potential duplicated rows per table
- clinvar_build.parser.main()[source]
Main entry point for ClinVar XML parser.
Parses command-line arguments, loads configuration, initializes parser, and processes ClinVar XML file into SQLite database.
- clinvar_build.parser.parse_arguments()[source]
Parse command-line arguments for database parsing.
- Returns:
Parsed command-line arguments containing:
db : Database file path
xml : XML file path
xsd : XSD schema file path (optional)
rcv : Whether to create RCV schema
vcv : Whether to create VCV schema
batch_size : Number of records to commit at once
enforce_fk : Whether to enforce foreign keys during parsing
no_validation : Whether to skip post-build validation
count_duplicates : Whether to count duplicate rows
verbose : Verbosity level (0-3)
- Return type:
General utils
General utility functions for the clinvar-build module
- clinvar_build.utils.general.assign_empty_default(arguments: list[Any], empty_object: Callable[[], Any]) list[Any][source]
Takes a list of arguments, checks if these are NoneType and if so asigns them ‘empty_object’.
- Parameters:
arguments (list [any]) – A list of arguments which may be set to NoneType.
empty_object (Callable)
object (A function that returns a mutable) – Examples include a list or a dict.
- Returns:
new_arguments – List with NoneType replaced by empty mutable object.
- Return type:
Examples
>>> assign_empty_default(['hi', None, 'hello'], empty_object=list) ['hi', [], 'hello']
Notes
This function helps deal with the pitfall of assigning an empty mutable object as a default function argument, which would persist through multiple function calls, leading to unexpected/undesired behaviours.
Configuration tools
Configuration parsing and XML validation utilities for ClinVar Build.
This module provides utilities for parsing configuration files, and managing logging output for long-running operations. It includes classes for handling block-based configuration files and property management with controlled access.
- class clinvar_build.utils.config_tools.BlockConfigParser(path: str)[source]
Parses configuration files with headers and concatenated content, essentially parsing a block of text to a single string.
This parser identifies sections marked by headers wrapped in square brackets and appends all subsequent lines (until the next header) into a single string for each section.
- Parameters:
path (str or Path) – Path to the configuration file to be parsed.
- parsed_data
The parsed data.
- Type:
dict [str, str]
Examples
Given a configuration file:
[header1] This is line one This is line two
[header2] Another section With multiple lines
- The parser will create:
- {
‘header1’: ‘This is line onenThis is line two’, ‘header2’: ‘Another sectionnWith multiple lines’
}
- __call__(section: str | None = None, sep='\n') Self[source]
Parse the configuration file into sections with concatenated content.
- sectionstr or None, default None
The specific section to parse from the configuration file. If provided, only content within the matching section delimiters (
[[section]]) will be processed. If None (default), all sections are parsed.
sep : str, default `
- `
The separator to indicate a new line.
- Self
The parser instance with populated data.
- class clinvar_build.utils.config_tools.ManagedProperty(name: str, types: tuple[type] | type | None = None)[source]
A generic property factory defining setters and getters, with optional type validation.
- Parameters:
name (str) – The name of the setters and getters
types (Type, default NoneType) – Either a single type, or a tuple of types to test against.
- set_with_setter(instance, value)[source]
Enables the setter, sets the property value, and then disables the setter, ensuring controlled updates.
- Returns:
A property object with getter and setter.
- Return type:
- class clinvar_build.utils.config_tools.ProgressHandler(stream=None)[source]
Custom handler that updates progress in place.
Uses ANSI escape codes to overwrite previous output instead of printing new lines. Useful for progress updates during long-running operations.
- Parameters:
stream (file-like object, optional) – Output stream. Defaults to sys.stdout.
- _last_line_count
Number of lines in the previous message, used to calculate how far to move the cursor up.
- Type:
int
Examples
>>> progress_logger = logging.getLogger('progress') >>> handler = ProgressHandler() >>> handler.setFormatter(logging.Formatter('%(asctime)s - %(message)s')) >>> progress_logger.addHandler(handler) >>> progress_logger.info("Processing: 100 records") >>> progress_logger.info("Processing: 200 records") # Overwrites previous
- clinvar_build.utils.config_tools.check_environ(environ_variable: str = 'CLINVAR_BUILD_CONFIG', fall_back: str | Path | None = PosixPath('/usr/local/share/clinvar_build/clinvar_config')) str[source]
Retrieve an environment variable pointing to a directory path, with optional fallback path.
Attempts to retrieve the specified environment variable. If the variable is not set, the function will attempt to use the fallback path if provided. This is useful for configuration management where environment variables may not always be explicitly set.
- Parameters:
environ_variable (str) – The name of the environment variable to retrieve.
fall_back (str, Path or None) – A fallback path to return if the environment variable is not set. If None, an error will be raised when the environment variable is missing.
- Returns:
A directory path.
- Return type:
- Raises:
Notes
The function will not check whether the path is available or whether permissions allow for read or write access
Warning
- UserWarning
Issued when the environment variable is not set and the fallback path is used instead.
Examples
>>> check_environ("MY_VAR") '/path/to/default/config'
Parser tools
XML parsing and SQLite database utilities for ClinVar Build.
This module provides classes and functions for parsing large XML clinvar files and loading them into SQLite databases. It includes utilities for database validation, progress tracking, error formatting, and XML inspection.
- class clinvar_build.utils.parser_tools.SQLiteParser(config: dict[str, Any] | None = None)[source]
A general SQLite parser.
- Parameters:
config (dict[str, Any]) – A dictionary with instructions to parser an XML file to a SQLite database.
- conn
Active SQLite connection, or None if not connected.
- Type:
sqlite3.Connection or None
- cursor
Cursor for executing SQL statements, or None if not connected.
- Type:
sqlite3.Cursor or None
- stats
Statistics on parsed records.
- Type:
dict [str, int]
- config
A configuration dictionary with parsing instructions.
- Type:
dict [str, any]
- __repr__() str[source]
Return unambiguous string representation.
- Returns:
String representation suitable for debugging
- Return type:
- __str__() str[source]
Return human-readable string representation.
- Returns:
Human-readable description of the parser
- Return type:
- count_duplicates()[source]
Count duplicate rows in all tables based on non-id columns.
For each table in the configuration, identifies duplicate rows where all columns (except the primary key ‘id’) are identical, treating NULL values as equal.
Notes
Uses SQLite’s rowid to identify duplicates, counting rows that would be removed (keeping the row with lowest rowid). NULL values are treated as equal via GROUP BY.
- validate_database() dict[str, Any][source]
Run comprehensive database validation checks.
Performs multiple validation checks including foreign key integrity, table row counts, and basic statistics.
- Returns:
Dictionary containing validation results and statistics
- Return type:
- Raises:
ValueError – If foreign key violations are found
Examples
>>> with parser._connection(db_path): ... results = parser.validate_database() >>> print(results) {'foreign_keys': 'valid', 'total_records': 12345, ...}
- class clinvar_build.utils.parser_tools.ViewXML(xml_path: str | Path, tag_name: str, index: int = 0)[source]
Class to view and inspect XML records.
This class loads a single XML element and provides utilities for inspecting and navigating its structure, particularly useful for large ClinVar XML files.
- Parameters:
xml_path (str or Path) – Path to ClinVar XML file (supports .gz compression)
tag_name (str) – XML tag name to search for (e.g., ‘VariationArchive’)
index (int, default 0) – Which occurrence of the tag to load (0-indexed)
- xml_path
Path to the source XML file
- Type:
Path
- tag_name
The tag name that was searched for
- Type:
str
- index
The index of the loaded element
- Type:
int
- element
The loaded XML element, or None if not found
- Type:
ET.Element or None
Examples
>>> viewer = ViewXML('clinvar.xml.gz', 'VariationArchive', index=5) >>> viewer.show_tree() >>> paths = viewer.find_all_paths()
- __init__(xml_path: str | Path, tag_name: str, index: int = 0)[source]
Initialise ViewXML with a single XML element.
- __repr__() str[source]
Return unambiguous string representation.
- Returns:
String that could recreate the object
- Return type:
- __str__() str[source]
Return human-readable string representation.
- Returns:
Summary of the loaded element
- Return type:
- count_all_tags() dict[str, int][source]
Count all descendant tags in loaded element tree.
- Returns:
dict of {str – Dictionary mapping tag names to their occurrence counts, ordered by frequency (most common first)
- Return type:
int}
- Raises:
ValueError – If no element is loaded
- find_all_paths() list[str][source]
Find all unique XPath-like paths in the loaded element tree.
- Returns:
Sorted list of all unique paths found in the tree
- Return type:
- Raises:
ValueError – If no element is loaded
Examples
>>> viewer = ViewXML('clinvar.xml.gz', 'VariationArchive') >>> paths = viewer.find_all_paths() >>> print(paths[:5])
- show_children(indent: int = 0) None[source]
Display immediate children of loaded element with summary info.
- Parameters:
indent (int, default 0) – Indentation level for output formatting
- Raises:
ValueError – If no element is loaded
- show_tree(max_depth=4, indent=0)[source]
” Display element tree structure up to specified depth.
- Parameters:
max_depth (int, default 4) – Maximum depth to display
indent (int, default 0) – Starting indentation level
- Raises:
ValueError – If no element is loaded
- clinvar_build.utils.parser_tools.configure_logging(verbosity: int) None[source]
Configure logging based on verbosity level.
This function configures the root logger, which affects all loggers in the application through inheritance.
- Parameters:
verbosity (int) – Number of -v flags. 0 = WARNING, 1 = INFO, 2 = DEBUG, 3 = TRACE
- clinvar_build.utils.parser_tools.open_xml_file(xml_path: str | Path, xsd_path: str | Path | None = None, strict: bool = True, verbose: bool = True)[source]
Helper function to open an XML file, handling compression based on extension.
Automatically closes the file handle when exiting the context, even if an exception occurs.
- Parameters:
xml_path (str) – The path to the file.
xsd_path (str, Path, or None, default None) – Optional path to XSD schema file for validation. If provided, validates XML before yielding the file handle.
strict (bool, default True) – If False, ignores elements in the XML that are not in the XSD. Only used when xsd_path is provided.
verbose (bool, default True) – If True, warns about validation issues when strict=False. Only used when xsd_path is provided.
- Yields:
file-like object – An open file handle for the compressed or uncompressed XML.
- Raises:
FileNotFoundError – If xml_path or xsd_path does not exist
IOError – If file cannot be opened
XMLValidationError – If XSD validation fails (when xsd_path is provided)
- clinvar_build.utils.parser_tools.trace(self, message, *args, **kwargs)[source]
custom trace method for detailed logging.
- clinvar_build.utils.parser_tools.validate_xml(xml_path: str | Path, xsd_path: str | Path, strict: bool = True, verbose: bool = True) _ElementTree[source]
Validates an XML file against an XSD schema.
- Parameters:
xml_path (str or Path) – Path to the XML file.
xsd_path (str or Path) – Path to the XSD file.
strict (bool, default True) – If False, ignores elements in the XML that are not in the XSD.
- Returns:
The parsed XML document.
- Return type:
etree._ElementTree
- Raises:
XMLValidationError – Raised if the XSD and XML are incompatible.