phenopacket_mapper.data_standards package
This submodule defines the data standards used in the project.
- class phenopacket_mapper.data_standards.Cardinality(min: int = 0, max: int | Literal['n'] = 'n')[source]
Bases:
object
- class property ZERO_TO_ONE: Cardinality[source]
Union[int, Literal[‘n’]] = ‘n’)
- Type:
Cardinality(min
- Type:
int = 0, max
- class property ZERO_TO_N: Cardinality[source]
Union[int, Literal[‘n’]] = ‘n’)
- Type:
Cardinality(min
- Type:
int = 0, max
- class property ONE: Cardinality[source]
Union[int, Literal[‘n’]] = ‘n’)
- Type:
Cardinality(min
- Type:
int = 0, max
- class property ONE_TO_N: Cardinality[source]
Union[int, Literal[‘n’]] = ‘n’)
- Type:
Cardinality(min
- Type:
int = 0, max
- class phenopacket_mapper.data_standards.Coding(system: str | CodeSystem, code: str, display: str = '', text: str = '')[source]
Bases:
object
Data class for Coding
A Coding is a representation of a concept defined by a code and a code system. It is used in the CodeableConcept data class.
- Variables:
system – The code system that defines the code
code – The code that represents the concept
display – The human readable representation of the concept
text – A human readable description or other additional text of the concept
- system: str | CodeSystem
- static parse_coding(coding_str: str, resources: List[CodeSystem], compliance: Literal['lenient', 'strict'] = 'lenient') Coding [source]
Parsed a string representing a coding to a Coding object
Expected format: <namespace_prefix>:<code>
E.g.: >>> Coding.parse_coding(“SNOMED:404684003”, [code_system.SNOMED_CT]) Coding(system=CodeSystem(name=SNOMED CT, name space prefix=SNOMED, version=0.0.0), code=’404684003’, display=’’, text=’’)
Intended to be called with a list of all resources used.
Can only recognize the name space prefixes that belong to code systems provided in the resources list. If a name space is not found in the resources, it will return a Coding object with the system as the name space prefix and the code as the code.
E.g.: >>> Coding.parse_coding(“SNOMED:404684003”, []) Warning: Code system with namespace prefix ‘SNOMED’ not found in resources. Warning: Returning Coding object with system as namespace prefix and code as ‘404684003’ Coding(system=’SNOMED’, code=’404684003’, display=’’, text=’’)
- Parameters:
coding_str – a string representing a coding
resources – a list of all resources used
compliance – whether to throw a ValueError or just a warning if a name space prefix is not found in the resources
- Returns:
a Coding object as specified in the coding string
- class phenopacket_mapper.data_standards.CodeableConcept(coding: List[Coding], text: str = '')[source]
Bases:
object
Data class for CodeableConcept
A CodeableConcept represents a concept that is defined by a set of codes. The concept may additionally have a text representation.
- Variables:
coding – A list of codings that define the concept
text – A text representation of the concept
- class phenopacket_mapper.data_standards.DataModel(name: str, fields: ~typing.Tuple[~phenopacket_mapper.data_standards.data_model.DataField | ~phenopacket_mapper.data_standards.data_model.DataSection | ~phenopacket_mapper.data_standards.data_model.OrGroup, ...], id: str = None, resources: ~typing.Tuple[~phenopacket_mapper.data_standards.code_system.CodeSystem, ...] = <factory>)[source]
Bases:
object
This class defines a data model for medical data using DataField
A data model can be used to import data and map it to the Phenopacket schema. It is made up of a list of DataField
Given that all DataField objects in a DataModel have unique names, the id field is generated from the name. E.g.: DataField(name=’Date of Birth’, …) will have an id of ‘date_of_birth’. The DataField objects can be accessed using the id as an attribute of the DataModel object. E.g.: data_model.date_of_birth. This is useful in the data reading and mapping processes.
- Variables:
name – Name of the data model
fields – List of DataField objects
resources – List of CodeSystem objects
- fields: Tuple[DataField | DataSection | OrGroup, ...]
- resources: Tuple[CodeSystem, ...]
- get_field(field_id: str, default: Optional = None) DataField | None [source]
Returns a DataField object by its id
- Parameters:
field_id – The id of the field
default – The default value to return if the field is not found
- Returns:
The DataField object
- load_data(path: str | Path, compliance: Literal['lenient', 'strict'] = 'lenient', **kwargs) DataSet [source]
Loads data from a file using a DataModel definition
To call this method, pass the column name for each field in the DataModel as a keyword argument. This is done by passing the field id followed by ‘_column’. E.g. if the DataModel has a field with id ‘date_of_birth’, the column name in the file should be passed as ‘date_of_birth_column’. The method will raise an error if any of the fields are missing.
E.g.:
`python data_model = DataModel("Test data model", [DataField(name="Field 1", value_set=ValueSet())]) data_model.load_data("data.csv", field_1_column="column_name_in_file") `
- Parameters:
path – Path to the file containing the data
compliance – Compliance level to use when loading the data.
kwargs – Dynamically passed parameters that match {id}_column for each item
- Returns:
A list of DataModelInstance objects
- class phenopacket_mapper.data_standards.DataField(name: str, specification: ~phenopacket_mapper.data_standards.value_set.ValueSet | type | ~typing.List[type], id: str = None, required: bool = False, description: str = '', cardinality: ~phenopacket_mapper.data_standards.cardinality.Cardinality = <factory>)[source]
Bases:
DataNode
This class defines fields used in the definition of a DataModel
A data field is the equivalent of a column in a table. It has a name, a value set, a description, a section, a required flag, a specification, and an ordinal.
The string for the id field is generated from the name field using the str_to_valid_id function from the phenopacket_mapper.utils module. This attempts to convert the name field. Sometimes this might not work as desired, in which case the id field can be set manually.
Naming rules for the id field: - The id field must be a valid Python identifier - The id field must start with a letter or the underscore character - The id field must cannot start with a number - The id field can only contain lowercase alphanumeric characters and underscores (a-z, 0-9, and _ ) - The id field cannot be any of the Python keywords (e.g. in, is, not, class, etc.). - The id field must be unique within a DataModel
If the value_set is a single type, it can be passed directly as the value_set parameter.
- Variables:
name – Name of the field
specification – Value set of the field, if the value set is only one type, can also pass that type directly
id – The identifier of the field, adhering to the naming rules stated above
description – Description of the field
required – Required flag of the field
- cardinality: Cardinality
- class phenopacket_mapper.data_standards.DataModelInstance(id: int | str, data_model: DataModel, values: Tuple[DataFieldValue | DataSectionInstance, ...], compliance: Literal['lenient', 'strict'] = 'lenient')[source]
Bases:
object
This class defines an instance of a DataModel, i.e. a record in a dataset
This class is used to define an instance of a DataModel, i.e. a record or row in a dataset.
- Variables:
id – The id of the instance, i.e. the row number
data_model – The DataModel object that defines the data model for this instance
values – A list of DataFieldValue objects, each adhering to the DataField definition in the DataModel
compliance – Compliance level to enforce when validating the instance. If ‘lenient’, the instance can have extra fields that are not in the DataModel. If ‘strict’, the instance must have all fields in the DataModel.
- values: Tuple[DataFieldValue | DataSectionInstance, ...]
- validate() bool [source]
Validates the data model instance based on data model definition
This method checks if the instance is valid based on the data model definition. It checks if all required fields are present, if the values are in the value set, etc.
- Returns:
True if the instance is valid, False otherwise
- class phenopacket_mapper.data_standards.DataFieldValue(id: str | int, field: DataField, value: int | float | str | bool | Date | CodeSystem)[source]
Bases:
object
This class defines the value of a DataField in a DataModelInstance
Equivalent to a cell value in a table.
- Variables:
id – The id of the value, i.e. the row number
field – DataField: The DataField to which this value belongs and which defines the value set for the field.
value – The value of the field.
- validate() bool [source]
Validates the data model instance based on data model definition
This method checks if the instance is valid based on the data model definition. It checks if all required fields are present, if the values are in the value set, etc.
- Returns:
True if the instance is valid, False otherwise
- class phenopacket_mapper.data_standards.DataSet(data_model: DataModel, data: List[DataModelInstance])[source]
Bases:
object
This class defines a dataset as defined by a DataModel
This class is used to define a dataset as defined by a DataModel. It is a collection of DataModelInstance objects.
- Variables:
data_model – The DataModel object that defines the data model for this dataset
data – A list of DataModelInstance objects, each adhering to the DataField definition in the DataModel
- data: List[DataModelInstance]
- property height
- property width
- preprocess(fields: str | DataField | List[str | DataField], mapping: Dict | Callable, **kwargs)[source]
Preprocesses a field in the dataset
Preprocessing happens in place, i.e. the values in the dataset are modified directly.
If fields is a list of fields, the mapping must be a method that can handle a list of values being passed as value to it. E.g.: ```python def preprocess_method(values, method, **kwargs): field1, field2 = values # do something with values return “preprocessed_values” + kwargs[“arg1”] + kwargs[“arg2”]
dataset.preprocess([“field_1”, “field_2”], preprocess_method, arg1=”value1”, arg2=”value2”) ```
- Parameters:
fields – Data fields to be preprocessed, will be passed onto mapping
mapping – A dictionary or method to use for preprocessing
- class phenopacket_mapper.data_standards.DataSection(name: str, id: str = None, fields: ~typing.Tuple[~phenopacket_mapper.data_standards.data_model.DataField | ~phenopacket_mapper.data_standards.data_model.DataSection | ~phenopacket_mapper.data_standards.data_model.OrGroup, ...] = <factory>, required: bool = False, cardinality: ~phenopacket_mapper.data_standards.cardinality.Cardinality = <factory>)[source]
Bases:
object
This class defines a section in a DataModel
A section is a collection of DataField or DataSection objects. It is used to group related fields in a DataModel.
- Variables:
name – Name of the section
fields – List of DataField objects
- fields: Tuple[DataField | DataSection | OrGroup, ...]
- cardinality: Cardinality
- class phenopacket_mapper.data_standards.OrGroup(fields: Tuple[phenopacket_mapper.data_standards.data_model.DataField | phenopacket_mapper.data_standards.data_model.DataSection | ForwardRef('OrGroup'), ...], name: str = 'Or Group', id: str = None, description: str = '', required: bool = False, cardinality: phenopacket_mapper.data_standards.cardinality.Cardinality = Cardinality(min=0, max='n'))[source]
Bases:
DataNode
- fields: Tuple[DataField | DataSection | OrGroup, ...]
- cardinality: Cardinality
- class phenopacket_mapper.data_standards.CodeSystem(name: str, namespace_prefix: str, url: str = None, iri_prefix: str = None, version: str = '0.0.0', synonyms: ~typing.List[str] = <factory>)[source]
Bases:
object
Data class for a CodeSystem
A CodeSystem is a resource that defines a set of codes and their meanings. It could be a terminology, an ontology, a nomenclature, etc. Popular examples include SNOMED CT, HPO, MONDO, OMIM, ORDO, LOINC, etc.
This class is necessary to fill the resources parameter in the Phenopacket later.
- Variables:
name – The name of the CodeSystem
namespace_prefix – The namespace prefix of the CodeSystem
url – The URL of the CodeSystem
iri_prefix – The IRI prefix of the CodeSystem
version – The version of the CodeSystem
- version: str
List typical alternative abbreviations or names for the resource, to better parse its usage (e.g. ‘HPO’ for the Human Phenotype Ontology, even if its name space prefix is commonly ‘HP’)
- set_version(value) CodeSystem [source]
- class phenopacket_mapper.data_standards.Date(year: int = 0, month: int = 0, day: int = 0, hour: int = 0, minute: int = 0, second: int = 0)[source]
Bases:
object
Data class for Date
This class defines a date object with many useful utility functions, especially for conversions from and to specific string formats.
- Variables:
year – the year of the date
month – the month of the date
day – the day of the date
hour – the hour of the date
minute – the minute of the date
second – the second of the date
- iso_8601_datestring(allow_zeros: bool = True) str [source]
Returns the date in ISO 8601 format
Example: “2021-06-02T16:52:15Z” Format: “{year}-{month}-{day}T{hour}:{min}:{sec}[.{frac_sec}]Z” Definition: The format for this is “{year}-{month}-{day}T{hour}:{min}:{sec}[.{frac_sec}]Z” where {year} is always expressed using four digits while {month}, {day}, {hour}, {min}, and {sec} are zero-padded to two digits each. The fractional seconds, which can go up to 9 digits (i.e. up to 1 nanosecond resolution), are optional. The “Z” suffix indicates the timezone (“UTC”); the timezone is required.
- protobuf_timestamp() Timestamp [source]
Returns the date in a Google Protobuf Timestamp object
- Returns:
the date in a Google Protobuf Timestamp object
- formatted_string(fmt: str) str [source]
Returns the date in the specified format
- Parameters:
fmt – the format as a string to return the date in
- Returns:
the date in the specified format
- static from_datetime(dt: datetime) Date [source]
Create a Date object from a datetime object
- Parameters:
dt – the datetime object to create the Date object from
- Returns:
the Date object created from the datetime object
- static from_iso_8601(iso_8601: str) Date | None [source]
Create a Date object from an ISO 8601 formatted string
- Parameters:
iso_8601 – the ISO 8601 formatted string to create the Date object from
- Returns:
the Date object created from the ISO 8601 formatted string
- static parse_date(date_str: str, default_first: Literal['day', 'month'] = 'day', compliance: Literal['lenient', 'strict'] = 'lenient') Date [source]
Parse a date string into a Date object
There is a lot of variation in how dates are formatted, and this function attempts to handle as many of them as possible. The function will first attempt to parse the date string as an ISO 8601 formatted string. If that fails, it will attempt to parse the date string as a date string with separators.
In this process it is sometimes unknowable whether 01-02-2024 is January 2nd or February 1st, so the function will use the default_first parameter to determine this. If the default_first parameter is set to “day”, the function will assume that the day comes first, and if it is set to “month”, the function will assume that the month comes first. If the default_first
- Parameters:
date_str – the date string to parse
default_first – the default unit to use if it is unclear which unit comes first between day and month
compliance – the compliance level of the parser
- Returns:
the Date object created from the date string
- class phenopacket_mapper.data_standards.ValueSet(elements: ~typing.Tuple[~phenopacket_mapper.data_standards.code.Coding | ~phenopacket_mapper.data_standards.code.CodeableConcept | ~phenopacket_mapper.data_standards.code_system.CodeSystem | str | bool | int | float | ~phenopacket_mapper.data_standards.date.Date | type, ...] = <factory>, name: str = '', description: str = '', _resources: ~typing.Tuple[~phenopacket_mapper.data_standards.code_system.CodeSystem, ...] = <factory>)[source]
Bases:
object
Defines a set of values that can be used in a DataField
A value set defines the viable values for a DataField. It can be a list of values, codings, codeable concepts, dates, etc. Also, it can just list types or CodeSystems that are allowed for a DataField.
Example usecases: - True, False, or Unknown - only allow strings - allow any numerical value (i.e., int, float) - allow any date - allow any code from one or more CodeSystems - allow only a specific set of codings - etc.
By assigning a ValueSet to a DataField, we can define the possible values for that field. This has multiple benefits : it allows for validation of the data, it facilitates the computability of the data, and it allows for better interoperability between different systems.
- Variables:
elements – List of elements that define the value set
name – Name of the value set
description – Description of the value set
- elements: Tuple[Coding | CodeableConcept | CodeSystem | str | bool | int | float | Date | type, ...]
- property resources: Tuple[CodeSystem, ...]
Returns the resources if they exist, otherwise provides a default empty list.
- static parse_value_set(value_set_str: str, value_set_name: str = '', value_set_description: str = '', resources: List[CodeSystem] = None, compliance: Literal['strict', 'lenient'] = 'lenient') ValueSet [source]
Parses a value set from a string representation
- Parameters:
value_set_str – String representation of the value set
value_set_name – Name of the value set
value_set_description – Description of the value set
resources – List of CodeSystems to use for parsing the value set
compliance – Compliance level for parsing the value set
- Returns:
A ValueSet object as defined by the string representation