phenopacket_mapper.data_standards package

This submodule defines the data standards used in the project.

class phenopacket_mapper.data_standards.Cardinality(min: int = 0, max: int | Literal['n'] = 'n')[source]

Bases: object

min: int
max: int | Literal['n']
class property ZERO_TO_ONE: Cardinality[source]

Union[int, Literal[‘n’]] = ‘n’)

Type:

Cardinality(min

Type:

int = 0, max

class property ZERO_TO_N: Cardinality[source]

Union[int, Literal[‘n’]] = ‘n’)

Type:

Cardinality(min

Type:

int = 0, max

class property ONE: Cardinality[source]

Union[int, Literal[‘n’]] = ‘n’)

Type:

Cardinality(min

Type:

int = 0, max

class property ONE_TO_N: Cardinality[source]

Union[int, Literal[‘n’]] = ‘n’)

Type:

Cardinality(min

Type:

int = 0, max

class phenopacket_mapper.data_standards.Coding(system: str | CodeSystem, code: str, display: str = '', text: str = '')[source]

Bases: object

Data class for Coding

A Coding is a representation of a concept defined by a code and a code system. It is used in the CodeableConcept data class.

Variables:
  • system – The code system that defines the code

  • code – The code that represents the concept

  • display – The human readable representation of the concept

  • text – A human readable description or other additional text of the concept

system: str | CodeSystem
code: str
display: str
text: str
static parse_coding(coding_str: str, resources: List[CodeSystem], compliance: Literal['lenient', 'strict'] = 'lenient') Coding[source]

Parsed a string representing a coding to a Coding object

Expected format: <namespace_prefix>:<code>

E.g.: >>> Coding.parse_coding(“SNOMED:404684003”, [code_system.SNOMED_CT]) Coding(system=CodeSystem(name=SNOMED CT, name space prefix=SNOMED, version=0.0.0), code=’404684003’, display=’’, text=’’)

Intended to be called with a list of all resources used.

Can only recognize the name space prefixes that belong to code systems provided in the resources list. If a name space is not found in the resources, it will return a Coding object with the system as the name space prefix and the code as the code.

E.g.: >>> Coding.parse_coding(“SNOMED:404684003”, []) Warning: Code system with namespace prefix ‘SNOMED’ not found in resources. Warning: Returning Coding object with system as namespace prefix and code as ‘404684003’ Coding(system=’SNOMED’, code=’404684003’, display=’’, text=’’)

Parameters:
  • coding_str – a string representing a coding

  • resources – a list of all resources used

  • compliance – whether to throw a ValueError or just a warning if a name space prefix is not found in the resources

Returns:

a Coding object as specified in the coding string

class phenopacket_mapper.data_standards.CodeableConcept(coding: List[Coding], text: str = '')[source]

Bases: object

Data class for CodeableConcept

A CodeableConcept represents a concept that is defined by a set of codes. The concept may additionally have a text representation.

Variables:
  • coding – A list of codings that define the concept

  • text – A text representation of the concept

coding: List[Coding]
text: str
class phenopacket_mapper.data_standards.DataModel(name: str, fields: ~typing.Tuple[~phenopacket_mapper.data_standards.data_model.DataField | ~phenopacket_mapper.data_standards.data_model.DataSection | ~phenopacket_mapper.data_standards.data_model.OrGroup, ...], id: str = None, resources: ~typing.Tuple[~phenopacket_mapper.data_standards.code_system.CodeSystem, ...] = <factory>)[source]

Bases: object

This class defines a data model for medical data using DataField

A data model can be used to import data and map it to the Phenopacket schema. It is made up of a list of DataField

Given that all DataField objects in a DataModel have unique names, the id field is generated from the name. E.g.: DataField(name=’Date of Birth’, …) will have an id of ‘date_of_birth’. The DataField objects can be accessed using the id as an attribute of the DataModel object. E.g.: data_model.date_of_birth. This is useful in the data reading and mapping processes.

Variables:
  • name – Name of the data model

  • fields – List of DataField objects

  • resources – List of CodeSystem objects

name: str
fields: Tuple[DataField | DataSection | OrGroup, ...]
id: str
resources: Tuple[CodeSystem, ...]
property is_hierarchical: bool
get_field(field_id: str, default: Optional = None) DataField | None[source]

Returns a DataField object by its id

Parameters:
  • field_id – The id of the field

  • default – The default value to return if the field is not found

Returns:

The DataField object

get_field_ids() List[str][source]

Returns a list of the ids of the DataFields in the DataModel

load_data(path: str | Path, compliance: Literal['lenient', 'strict'] = 'lenient', **kwargs) DataSet[source]

Loads data from a file using a DataModel definition

To call this method, pass the column name for each field in the DataModel as a keyword argument. This is done by passing the field id followed by ‘_column’. E.g. if the DataModel has a field with id ‘date_of_birth’, the column name in the file should be passed as ‘date_of_birth_column’. The method will raise an error if any of the fields are missing.

E.g.: `python data_model = DataModel("Test data model", [DataField(name="Field 1", value_set=ValueSet())]) data_model.load_data("data.csv", field_1_column="column_name_in_file") `

Parameters:
  • path – Path to the file containing the data

  • compliance – Compliance level to use when loading the data.

  • kwargs – Dynamically passed parameters that match {id}_column for each item

Returns:

A list of DataModelInstance objects

class phenopacket_mapper.data_standards.DataField(name: str, specification: ~phenopacket_mapper.data_standards.value_set.ValueSet | type | ~typing.List[type], id: str = None, required: bool = False, description: str = '', cardinality: ~phenopacket_mapper.data_standards.cardinality.Cardinality = <factory>)[source]

Bases: DataNode

This class defines fields used in the definition of a DataModel

A data field is the equivalent of a column in a table. It has a name, a value set, a description, a section, a required flag, a specification, and an ordinal.

The string for the id field is generated from the name field using the str_to_valid_id function from the phenopacket_mapper.utils module. This attempts to convert the name field. Sometimes this might not work as desired, in which case the id field can be set manually.

Naming rules for the id field: - The id field must be a valid Python identifier - The id field must start with a letter or the underscore character - The id field must cannot start with a number - The id field can only contain lowercase alphanumeric characters and underscores (a-z, 0-9, and _ ) - The id field cannot be any of the Python keywords (e.g. in, is, not, class, etc.). - The id field must be unique within a DataModel

If the value_set is a single type, it can be passed directly as the value_set parameter.

Variables:
  • name – Name of the field

  • specification – Value set of the field, if the value set is only one type, can also pass that type directly

  • id – The identifier of the field, adhering to the naming rules stated above

  • description – Description of the field

  • required – Required flag of the field

name: str
specification: ValueSet | type | List[type]
id: str
required: bool
description: str
cardinality: Cardinality
class phenopacket_mapper.data_standards.DataModelInstance(id: int | str, data_model: DataModel, values: Tuple[DataFieldValue | DataSectionInstance, ...], compliance: Literal['lenient', 'strict'] = 'lenient')[source]

Bases: object

This class defines an instance of a DataModel, i.e. a record in a dataset

This class is used to define an instance of a DataModel, i.e. a record or row in a dataset.

Variables:
  • id – The id of the instance, i.e. the row number

  • data_model – The DataModel object that defines the data model for this instance

  • values – A list of DataFieldValue objects, each adhering to the DataField definition in the DataModel

  • compliance – Compliance level to enforce when validating the instance. If ‘lenient’, the instance can have extra fields that are not in the DataModel. If ‘strict’, the instance must have all fields in the DataModel.

id: int | str
data_model: DataModel
values: Tuple[DataFieldValue | DataSectionInstance, ...]
compliance: Literal['lenient', 'strict']
validate() bool[source]

Validates the data model instance based on data model definition

This method checks if the instance is valid based on the data model definition. It checks if all required fields are present, if the values are in the value set, etc.

Returns:

True if the instance is valid, False otherwise

class phenopacket_mapper.data_standards.DataFieldValue(id: str | int, field: DataField, value: int | float | str | bool | Date | CodeSystem)[source]

Bases: object

This class defines the value of a DataField in a DataModelInstance

Equivalent to a cell value in a table.

Variables:
  • id – The id of the value, i.e. the row number

  • field – DataField: The DataField to which this value belongs and which defines the value set for the field.

  • value – The value of the field.

id: str | int
field: DataField
value: int | float | str | bool | Date | CodeSystem
validate() bool[source]

Validates the data model instance based on data model definition

This method checks if the instance is valid based on the data model definition. It checks if all required fields are present, if the values are in the value set, etc.

Returns:

True if the instance is valid, False otherwise

class phenopacket_mapper.data_standards.DataSet(data_model: DataModel, data: List[DataModelInstance])[source]

Bases: object

This class defines a dataset as defined by a DataModel

This class is used to define a dataset as defined by a DataModel. It is a collection of DataModelInstance objects.

Variables:
  • data_model – The DataModel object that defines the data model for this dataset

  • data – A list of DataModelInstance objects, each adhering to the DataField definition in the DataModel

data_model: DataModel
data: List[DataModelInstance]
property height
property width
property data_frame: DataFrame
preprocess(fields: str | DataField | List[str | DataField], mapping: Dict | Callable, **kwargs)[source]

Preprocesses a field in the dataset

Preprocessing happens in place, i.e. the values in the dataset are modified directly.

If fields is a list of fields, the mapping must be a method that can handle a list of values being passed as value to it. E.g.: ```python def preprocess_method(values, method, **kwargs): field1, field2 = values # do something with values return “preprocessed_values” + kwargs[“arg1”] + kwargs[“arg2”]

dataset.preprocess([“field_1”, “field_2”], preprocess_method, arg1=”value1”, arg2=”value2”) ```

Parameters:
  • fields – Data fields to be preprocessed, will be passed onto mapping

  • mapping – A dictionary or method to use for preprocessing

head(n: int = 5)[source]
class phenopacket_mapper.data_standards.DataSection(name: str, id: str = None, fields: ~typing.Tuple[~phenopacket_mapper.data_standards.data_model.DataField | ~phenopacket_mapper.data_standards.data_model.DataSection | ~phenopacket_mapper.data_standards.data_model.OrGroup, ...] = <factory>, required: bool = False, cardinality: ~phenopacket_mapper.data_standards.cardinality.Cardinality = <factory>)[source]

Bases: object

This class defines a section in a DataModel

A section is a collection of DataField or DataSection objects. It is used to group related fields in a DataModel.

Variables:
  • name – Name of the section

  • fields – List of DataField objects

name: str
id: str
fields: Tuple[DataField | DataSection | OrGroup, ...]
required: bool
cardinality: Cardinality
class phenopacket_mapper.data_standards.OrGroup(fields: Tuple[phenopacket_mapper.data_standards.data_model.DataField | phenopacket_mapper.data_standards.data_model.DataSection | ForwardRef('OrGroup'), ...], name: str = 'Or Group', id: str = None, description: str = '', required: bool = False, cardinality: phenopacket_mapper.data_standards.cardinality.Cardinality = Cardinality(min=0, max='n'))[source]

Bases: DataNode

fields: Tuple[DataField | DataSection | OrGroup, ...]
name: str
id: str
description: str
required: bool
cardinality: Cardinality
class phenopacket_mapper.data_standards.CodeSystem(name: str, namespace_prefix: str, url: str = None, iri_prefix: str = None, version: str = '0.0.0', synonyms: ~typing.List[str] = <factory>)[source]

Bases: object

Data class for a CodeSystem

A CodeSystem is a resource that defines a set of codes and their meanings. It could be a terminology, an ontology, a nomenclature, etc. Popular examples include SNOMED CT, HPO, MONDO, OMIM, ORDO, LOINC, etc.

This class is necessary to fill the resources parameter in the Phenopacket later.

Variables:
  • name – The name of the CodeSystem

  • namespace_prefix – The namespace prefix of the CodeSystem

  • url – The URL of the CodeSystem

  • iri_prefix – The IRI prefix of the CodeSystem

  • version – The version of the CodeSystem

name: str
namespace_prefix: str
url: str
iri_prefix: str
version: str

List typical alternative abbreviations or names for the resource, to better parse its usage (e.g. ‘HPO’ for the Human Phenotype Ontology, even if its name space prefix is commonly ‘HP’)

synonyms: List[str]
set_version(value) CodeSystem[source]
class phenopacket_mapper.data_standards.Date(year: int = 0, month: int = 0, day: int = 0, hour: int = 0, minute: int = 0, second: int = 0)[source]

Bases: object

Data class for Date

This class defines a date object with many useful utility functions, especially for conversions from and to specific string formats.

Variables:
  • year – the year of the date

  • month – the month of the date

  • day – the day of the date

  • hour – the hour of the date

  • minute – the minute of the date

  • second – the second of the date

year: int
month: int
day: int
hour: int
minute: int
second: int
year_str: str
month_str: str
day_str: str
hour_str: str
minute_str: str
second_str: str
iso_8601_datestring(allow_zeros: bool = True) str[source]

Returns the date in ISO 8601 format

Example: “2021-06-02T16:52:15Z” Format: “{year}-{month}-{day}T{hour}:{min}:{sec}[.{frac_sec}]Z” Definition: The format for this is “{year}-{month}-{day}T{hour}:{min}:{sec}[.{frac_sec}]Z” where {year} is always expressed using four digits while {month}, {day}, {hour}, {min}, and {sec} are zero-padded to two digits each. The fractional seconds, which can go up to 9 digits (i.e. up to 1 nanosecond resolution), are optional. The “Z” suffix indicates the timezone (“UTC”); the timezone is required.

protobuf_timestamp() Timestamp[source]

Returns the date in a Google Protobuf Timestamp object

Returns:

the date in a Google Protobuf Timestamp object

formatted_string(fmt: str) str[source]

Returns the date in the specified format

Parameters:

fmt – the format as a string to return the date in

Returns:

the date in the specified format

static from_datetime(dt: datetime) Date[source]

Create a Date object from a datetime object

Parameters:

dt – the datetime object to create the Date object from

Returns:

the Date object created from the datetime object

static from_iso_8601(iso_8601: str) Date | None[source]

Create a Date object from an ISO 8601 formatted string

Parameters:

iso_8601 – the ISO 8601 formatted string to create the Date object from

Returns:

the Date object created from the ISO 8601 formatted string

static parse_date(date_str: str, default_first: Literal['day', 'month'] = 'day', compliance: Literal['lenient', 'strict'] = 'lenient') Date[source]

Parse a date string into a Date object

There is a lot of variation in how dates are formatted, and this function attempts to handle as many of them as possible. The function will first attempt to parse the date string as an ISO 8601 formatted string. If that fails, it will attempt to parse the date string as a date string with separators.

In this process it is sometimes unknowable whether 01-02-2024 is January 2nd or February 1st, so the function will use the default_first parameter to determine this. If the default_first parameter is set to “day”, the function will assume that the day comes first, and if it is set to “month”, the function will assume that the month comes first. If the default_first

Parameters:
  • date_str – the date string to parse

  • default_first – the default unit to use if it is unclear which unit comes first between day and month

  • compliance – the compliance level of the parser

Returns:

the Date object created from the date string

class phenopacket_mapper.data_standards.ValueSet(elements: ~typing.Tuple[~phenopacket_mapper.data_standards.code.Coding | ~phenopacket_mapper.data_standards.code.CodeableConcept | ~phenopacket_mapper.data_standards.code_system.CodeSystem | str | bool | int | float | ~phenopacket_mapper.data_standards.date.Date | type, ...] = <factory>, name: str = '', description: str = '', _resources: ~typing.Tuple[~phenopacket_mapper.data_standards.code_system.CodeSystem, ...] = <factory>)[source]

Bases: object

Defines a set of values that can be used in a DataField

A value set defines the viable values for a DataField. It can be a list of values, codings, codeable concepts, dates, etc. Also, it can just list types or CodeSystems that are allowed for a DataField.

Example usecases: - True, False, or Unknown - only allow strings - allow any numerical value (i.e., int, float) - allow any date - allow any code from one or more CodeSystems - allow only a specific set of codings - etc.

By assigning a ValueSet to a DataField, we can define the possible values for that field. This has multiple benefits : it allows for validation of the data, it facilitates the computability of the data, and it allows for better interoperability between different systems.

Variables:
  • elements – List of elements that define the value set

  • name – Name of the value set

  • description – Description of the value set

elements: Tuple[Coding | CodeableConcept | CodeSystem | str | bool | int | float | Date | type, ...]
name: str
description: str
extend(new_name: str, value_set: ValueSet, new_description: str = '') ValueSet[source]
remove_duplicates() ValueSet[source]
property resources: Tuple[CodeSystem, ...]

Returns the resources if they exist, otherwise provides a default empty list.

static parse_value_set(value_set_str: str, value_set_name: str = '', value_set_description: str = '', resources: List[CodeSystem] = None, compliance: Literal['strict', 'lenient'] = 'lenient') ValueSet[source]

Parses a value set from a string representation

Parameters:
  • value_set_str – String representation of the value set

  • value_set_name – Name of the value set

  • value_set_description – Description of the value set

  • resources – List of CodeSystems to use for parsing the value set

  • compliance – Compliance level for parsing the value set

Returns:

A ValueSet object as defined by the string representation

Subpackages

Submodules