phenopacket_mapper.data_standards package

This submodule defines the data standards used in the project.

class phenopacket_mapper.data_standards.Cardinality(min: int = 0, max: int | Literal['n'] = 'n')[source]

Bases: object

min: int

max: int | Literal['n']

class property ZERO_TO_ONE: Cardinality[source]

Union[int, Literal[‘n’]] = ‘n’)

Type:: Cardinality(min
Type:: int = 0, max

class property ZERO_TO_N: Cardinality[source]

Union[int, Literal[‘n’]] = ‘n’)

Type:: Cardinality(min
Type:: int = 0, max

class property ONE: Cardinality[source]

Union[int, Literal[‘n’]] = ‘n’)

Type:: Cardinality(min
Type:: int = 0, max

class property ONE_TO_N: Cardinality[source]

Union[int, Literal[‘n’]] = ‘n’)

Type:: Cardinality(min
Type:: int = 0, max

class phenopacket_mapper.data_standards.Coding(system: str | CodeSystem, code: str, display: str = '', text: str = '')[source]

Bases: object

Data class for Coding

A Coding is a representation of a concept defined by a code and a code system. It is used in the CodeableConcept data class.

Variables:

system – The code system that defines the code
code – The code that represents the concept
display – The human readable representation of the concept
text – A human readable description or other additional text of the concept

system: str | CodeSystem

code: str

display: str

text: str

static parse_coding(coding_str: str, resources: List[CodeSystem], compliance: Literal['lenient', 'strict'] = 'lenient') → Coding[source]

Parsed a string representing a coding to a Coding object

Expected format: <namespace_prefix>:<code>

E.g.: >>> Coding.parse_coding(“SNOMED:404684003”, [code_system.SNOMED_CT]) Coding(system=CodeSystem(name=SNOMED CT, name space prefix=SNOMED, version=0.0.0), code=’404684003’, display=’’, text=’’)

Intended to be called with a list of all resources used.

Can only recognize the name space prefixes that belong to code systems provided in the resources list. If a name space is not found in the resources, it will return a Coding object with the system as the name space prefix and the code as the code.

E.g.: >>> Coding.parse_coding(“SNOMED:404684003”, []) Warning: Code system with namespace prefix ‘SNOMED’ not found in resources. Warning: Returning Coding object with system as namespace prefix and code as ‘404684003’ Coding(system=’SNOMED’, code=’404684003’, display=’’, text=’’)

Parameters:

coding_str – a string representing a coding
resources – a list of all resources used
compliance – whether to throw a ValueError or just a warning if a name space prefix is not found in the resources

Returns:

a Coding object as specified in the coding string

class phenopacket_mapper.data_standards.CodeableConcept(coding: List[Coding], text: str = '')[source]

Bases: object

Data class for CodeableConcept

A CodeableConcept represents a concept that is defined by a set of codes. The concept may additionally have a text representation.

Variables:

coding – A list of codings that define the concept
text – A text representation of the concept

coding: List[Coding]

text: str

class phenopacket_mapper.data_standards.DataModel(name: str, fields: ~typing.Tuple[~phenopacket_mapper.data_standards.data_model.DataField | ~phenopacket_mapper.data_standards.data_model.DataSection | ~phenopacket_mapper.data_standards.data_model.OrGroup, ...], id: str = None, resources: ~typing.Tuple[~phenopacket_mapper.data_standards.code_system.CodeSystem, ...] = <factory>)[source]

Bases: object

This class defines a data model for medical data using DataField

A data model can be used to import data and map it to the Phenopacket schema. It is made up of a list of DataField

Given that all DataField objects in a DataModel have unique names, the id field is generated from the name. E.g.: DataField(name=’Date of Birth’, …) will have an id of ‘date_of_birth’. The DataField objects can be accessed using the id as an attribute of the DataModel object. E.g.: data_model.date_of_birth. This is useful in the data reading and mapping processes.

Variables:

name – Name of the data model
fields – List of DataField objects
resources – List of CodeSystem objects

name: str

fields: Tuple[DataField | DataSection | OrGroup, ...]

id: str

resources: Tuple[CodeSystem, ...]

property is_hierarchical: bool

get_field(field_id: str, default: Optional = None) → DataField | None[source]

Returns a DataField object by its id

Parameters:

field_id – The id of the field
default – The default value to return if the field is not found

Returns:

The DataField object

get_field_ids() → List[str][source]: Returns a list of the ids of the DataFields in the DataModel

load_data(path: str | Path, compliance: Literal['lenient', 'strict'] = 'lenient', **kwargs) → DataSet[source]

Loads data from a file using a DataModel definition

To call this method, pass the column name for each field in the DataModel as a keyword argument. This is done by passing the field id followed by ‘_column’. E.g. if the DataModel has a field with id ‘date_of_birth’, the column name in the file should be passed as ‘date_of_birth_column’. The method will raise an error if any of the fields are missing.

E.g.: `python data_model = DataModel("Test data model", [DataField(name="Field 1", value_set=ValueSet())]) data_model.load_data("data.csv", field_1_column="column_name_in_file") `

Parameters:

path – Path to the file containing the data
compliance – Compliance level to use when loading the data.
kwargs – Dynamically passed parameters that match {id}_column for each item

Returns:

A list of DataModelInstance objects

class phenopacket_mapper.data_standards.DataField(name: str, specification: ~phenopacket_mapper.data_standards.value_set.ValueSet | type | ~typing.List[type], id: str = None, required: bool = False, description: str = '', cardinality: ~phenopacket_mapper.data_standards.cardinality.Cardinality = <factory>)[source]

Bases: DataNode

This class defines fields used in the definition of a DataModel

A data field is the equivalent of a column in a table. It has a name, a value set, a description, a section, a required flag, a specification, and an ordinal.

The string for the id field is generated from the name field using the str_to_valid_id function from the phenopacket_mapper.utils module. This attempts to convert the name field. Sometimes this might not work as desired, in which case the id field can be set manually.

Naming rules for the id field: - The id field must be a valid Python identifier - The id field must start with a letter or the underscore character - The id field must cannot start with a number - The id field can only contain lowercase alphanumeric characters and underscores (a-z, 0-9, and _ ) - The id field cannot be any of the Python keywords (e.g. in, is, not, class, etc.). - The id field must be unique within a DataModel

If the value_set is a single type, it can be passed directly as the value_set parameter.

Variables:

name – Name of the field
specification – Value set of the field, if the value set is only one type, can also pass that type directly
id – The identifier of the field, adhering to the naming rules stated above
description – Description of the field
required – Required flag of the field

name: str

specification: ValueSet | type | List[type]

id: str

required: bool

description: str

cardinality: Cardinality

class phenopacket_mapper.data_standards.DataModelInstance(id: int | str, data_model: DataModel, values: Tuple[DataFieldValue | DataSectionInstance, ...], compliance: Literal['lenient', 'strict'] = 'lenient')[source]

Bases: object

This class defines an instance of a DataModel, i.e. a record in a dataset

This class is used to define an instance of a DataModel, i.e. a record or row in a dataset.

Variables:

id – The id of the instance, i.e. the row number
data_model – The DataModel object that defines the data model for this instance
values – A list of DataFieldValue objects, each adhering to the DataField definition in the DataModel
compliance – Compliance level to enforce when validating the instance. If ‘lenient’, the instance can have extra fields that are not in the DataModel. If ‘strict’, the instance must have all fields in the DataModel.

id: int | str

data_model: DataModel

values: Tuple[DataFieldValue | DataSectionInstance, ...]

compliance: Literal['lenient', 'strict']

validate() → bool[source]

Validates the data model instance based on data model definition

This method checks if the instance is valid based on the data model definition. It checks if all required fields are present, if the values are in the value set, etc.

Returns:: True if the instance is valid, False otherwise

Bases: object

This class defines the value of a DataField in a DataModelInstance

Equivalent to a cell value in a table.

Variables:

id – The id of the value, i.e. the row number
field – DataField: The DataField to which this value belongs and which defines the value set for the field.
value – The value of the field.

id: str | int

field: DataField

value: int | float | str | bool | Date | CodeSystem

validate() → bool[source]

Validates the data model instance based on data model definition

This method checks if the instance is valid based on the data model definition. It checks if all required fields are present, if the values are in the value set, etc.

Returns:: True if the instance is valid, False otherwise

class phenopacket_mapper.data_standards.DataSet(data_model: DataModel, data: List[DataModelInstance])[source]

Bases: object

This class defines a dataset as defined by a DataModel

This class is used to define a dataset as defined by a DataModel. It is a collection of DataModelInstance objects.

Variables:

data_model – The DataModel object that defines the data model for this dataset
data – A list of DataModelInstance objects, each adhering to the DataField definition in the DataModel

data_model: DataModel

data: List[DataModelInstance]

property height

property width

property data_frame: DataFrame

preprocess(fields: str | DataField | List[str | DataField], mapping: Dict | Callable, **kwargs)[source]

Preprocesses a field in the dataset

Preprocessing happens in place, i.e. the values in the dataset are modified directly.

If fields is a list of fields, the mapping must be a method that can handle a list of values being passed as value to it. E.g.: ```python def preprocess_method(values, method, **kwargs): field1, field2 = values # do something with values return “preprocessed_values” + kwargs[“arg1”] + kwargs[“arg2”]

dataset.preprocess([“field_1”, “field_2”], preprocess_method, arg1=”value1”, arg2=”value2”) ```

Parameters:

fields – Data fields to be preprocessed, will be passed onto mapping
mapping – A dictionary or method to use for preprocessing

head(n: int = 5)[source]

class phenopacket_mapper.data_standards.DataSection(name: str, id: str = None, fields: ~typing.Tuple[~phenopacket_mapper.data_standards.data_model.DataField | ~phenopacket_mapper.data_standards.data_model.DataSection | ~phenopacket_mapper.data_standards.data_model.OrGroup, ...] = <factory>, required: bool = False, cardinality: ~phenopacket_mapper.data_standards.cardinality.Cardinality = <factory>)[source]

Bases: object

This class defines a section in a DataModel

A section is a collection of DataField or DataSection objects. It is used to group related fields in a DataModel.

Variables:

name – Name of the section
fields – List of DataField objects

name: str

id: str

fields: Tuple[DataField | DataSection | OrGroup, ...]

required: bool

cardinality: Cardinality

class phenopacket_mapper.data_standards.OrGroup(fields: Tuple[phenopacket_mapper.data_standards.data_model.DataField | phenopacket_mapper.data_standards.data_model.DataSection | ForwardRef('OrGroup'), ...], name: str = 'Or Group', id: str = None, description: str = '', required: bool = False, cardinality: phenopacket_mapper.data_standards.cardinality.Cardinality = Cardinality(min=0, max='n'))[source]

Bases: DataNode

fields: Tuple[DataField | DataSection | OrGroup, ...]

name: str

id: str

description: str

required: bool

cardinality: Cardinality

class phenopacket_mapper.data_standards.CodeSystem(name: str, namespace_prefix: str, url: str = None, iri_prefix: str = None, version: str = '0.0.0', synonyms: ~typing.List[str] = <factory>)[source]

Bases: object

Data class for a CodeSystem

A CodeSystem is a resource that defines a set of codes and their meanings. It could be a terminology, an ontology, a nomenclature, etc. Popular examples include SNOMED CT, HPO, MONDO, OMIM, ORDO, LOINC, etc.

This class is necessary to fill the resources parameter in the Phenopacket later.

Variables:

name – The name of the CodeSystem
namespace_prefix – The namespace prefix of the CodeSystem
url – The URL of the CodeSystem
iri_prefix – The IRI prefix of the CodeSystem
version – The version of the CodeSystem

name: str

namespace_prefix: str

url: str

iri_prefix: str

version: str: List typical alternative abbreviations or names for the resource, to better parse its usage (e.g. ‘HPO’ for the Human Phenotype Ontology, even if its name space prefix is commonly ‘HP’)

synonyms: List[str]

set_version(value) → CodeSystem[source]

class phenopacket_mapper.data_standards.Date(year: int = 0, month: int = 0, day: int = 0, hour: int = 0, minute: int = 0, second: int = 0)[source]

Bases: object

Data class for Date

This class defines a date object with many useful utility functions, especially for conversions from and to specific string formats.

Variables:

year – the year of the date
month – the month of the date
day – the day of the date
hour – the hour of the date
minute – the minute of the date
second – the second of the date

year: int

month: int

day: int

hour: int

minute: int

second: int

year_str: str

month_str: str

day_str: str

hour_str: str

minute_str: str

second_str: str

iso_8601_datestring(allow_zeros: bool = True) → str[source]

Returns the date in ISO 8601 format

Example: “2021-06-02T16:52:15Z” Format: “{year}-{month}-{day}T{hour}:{min}:{sec}[.{frac_sec}]Z” Definition: The format for this is “{year}-{month}-{day}T{hour}:{min}:{sec}[.{frac_sec}]Z” where {year} is always expressed using four digits while {month}, {day}, {hour}, {min}, and {sec} are zero-padded to two digits each. The fractional seconds, which can go up to 9 digits (i.e. up to 1 nanosecond resolution), are optional. The “Z” suffix indicates the timezone (“UTC”); the timezone is required.

protobuf_timestamp() → Timestamp[source]

Returns the date in a Google Protobuf Timestamp object

Returns:: the date in a Google Protobuf Timestamp object

formatted_string(fmt: str) → str[source]

Returns the date in the specified format

Parameters:: fmt – the format as a string to return the date in
Returns:: the date in the specified format

static from_datetime(dt: datetime) → Date[source]

Create a Date object from a datetime object

Parameters:: dt – the datetime object to create the Date object from
Returns:: the Date object created from the datetime object

static from_iso_8601(iso_8601: str) → Date | None[source]

Create a Date object from an ISO 8601 formatted string

Parameters:: iso_8601 – the ISO 8601 formatted string to create the Date object from
Returns:: the Date object created from the ISO 8601 formatted string

static parse_date(date_str: str, default_first: Literal['day', 'month'] = 'day', compliance: Literal['lenient', 'strict'] = 'lenient') → Date[source]

Parse a date string into a Date object

There is a lot of variation in how dates are formatted, and this function attempts to handle as many of them as possible. The function will first attempt to parse the date string as an ISO 8601 formatted string. If that fails, it will attempt to parse the date string as a date string with separators.

In this process it is sometimes unknowable whether 01-02-2024 is January 2nd or February 1st, so the function will use the default_first parameter to determine this. If the default_first parameter is set to “day”, the function will assume that the day comes first, and if it is set to “month”, the function will assume that the month comes first. If the default_first

Parameters:

date_str – the date string to parse
default_first – the default unit to use if it is unclear which unit comes first between day and month
compliance – the compliance level of the parser

Returns:

the Date object created from the date string

class phenopacket_mapper.data_standards.ValueSet(elements: ~typing.Tuple[~phenopacket_mapper.data_standards.code.Coding | ~phenopacket_mapper.data_standards.code.CodeableConcept | ~phenopacket_mapper.data_standards.code_system.CodeSystem | str | bool | int | float | ~phenopacket_mapper.data_standards.date.Date | type, ...] = <factory>, name: str = '', description: str = '', _resources: ~typing.Tuple[~phenopacket_mapper.data_standards.code_system.CodeSystem, ...] = <factory>)[source]

Bases: object

Defines a set of values that can be used in a DataField

A value set defines the viable values for a DataField. It can be a list of values, codings, codeable concepts, dates, etc. Also, it can just list types or CodeSystems that are allowed for a DataField.

Example usecases: - True, False, or Unknown - only allow strings - allow any numerical value (i.e., int, float) - allow any date - allow any code from one or more CodeSystems - allow only a specific set of codings - etc.

By assigning a ValueSet to a DataField, we can define the possible values for that field. This has multiple benefits : it allows for validation of the data, it facilitates the computability of the data, and it allows for better interoperability between different systems.

Variables:

elements – List of elements that define the value set
name – Name of the value set
description – Description of the value set

elements: Tuple[Coding | CodeableConcept | CodeSystem | str | bool | int | float | Date | type, ...]

name: str

description: str

extend(new_name: str, value_set: ValueSet, new_description: str = '') → ValueSet[source]

remove_duplicates() → ValueSet[source]

property resources: Tuple[CodeSystem, ...]: Returns the resources if they exist, otherwise provides a default empty list.

static parse_value_set(value_set_str: str, value_set_name: str = '', value_set_description: str = '', resources: List[CodeSystem] = None, compliance: Literal['strict', 'lenient'] = 'lenient') → ValueSet[source]

Parses a value set from a string representation

Parameters:

value_set_str – String representation of the value set
value_set_name – Name of the value set
value_set_description – Description of the value set
resources – List of CodeSystems to use for parsing the value set
compliance – Compliance level for parsing the value set

Returns:

A ValueSet object as defined by the string representation

Subpackages

phenopacket_mapper.data_standards.data_models package
- Submodules

phenopacket_mapper.data_standards package

Subpackages

Submodules