Skip to content

Dataset Templates

Overview

Templates describe how data should be transformed. A template is a collection of fields (columns) that describe the desired format of some input data. Templates are used to import files, export, query or migrate data. They allow for field normalization and transformation, and also for adding additional fields and annotations.

Retrieving Templates

To list all templates:

1
2
for template in DatasetTemplate.all(template_type="dataset"):
    print(template['name'], template['id'], template['account'], template['is_public'])
1
all_templates = DatasetTemplate.all(template_type="dataset")

It will return all available dataset templates with their names, template id, organization account id and status

To retrieve a template by known ID:

1
2
template = solvebio.DatasetTemplate.retrieve('template_id')
print(template)
1
template = DatasetTemplate.retrieve(id="template_id")

See the DatasetTemplate reference for documentation on all of the parameters.

The list of fields is the most important part of a template. Each field describes a DatasetField

Example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
fields = [
{
    "name": "reason",
    "title": "Reason"
    "description": "The reasons for the significance value",
    "data_type": "string",
    "depends_on": [
        "reason_list"
    ],
    "expression": "', '.join(record.reason_list) if record.reason_list else None",
    "ordering": 1,
}
{...}
]
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
fields <- list(
    list(
        name="reason",
        title="Reason",
        description="The reasons for the significance value",
        data_type="string",
        depends_on=list("reason_list"),
        expression="', '.join(record.reason_list) if record.reason_list else None",
        ordering=1,
    ),
    list(...)
)

Create a template

To create a template, prepare the list of DatasetFields with information about data types, expressions, entities, etc.

Read more about expressions: Expression functions

Example of list of fields:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
fields = [
{
    'name': 'sample',
    "depends_on": ['subject'],
    "entity_type": "sample",
    'description': 'Sample ID from SUBJECT',
    'data_type': 'string',
    'ordering': 1 ,
    'expression': "record.subject"
},
{
    'name': 'study',
    'title': 'STUDY',
    'description': 'Study Code',
    'ordering': 2 ,
    'expression': "None if value == 'UNASSIGNED' else value",
    'data_type': 'string'
},
{
    "data_type": "string",
    "depends_on": [
        "hgvs_c"
    ],
    "description": "SolveBio variant entity, computed from the short variant CDS change",
    "expression": "entity_ids('variant', record.hgvs_c) if record.hgvs_c else None",
    "is_transient": True,
    "name": "variant_cdna_grch38"
}
]
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
fields <- list(
    list(
        name='sample',
        depends_on=list('subject'),
        entity_type="sample",
        description='Sample ID from SUBJECT',
        data_type='string',
        ordering=1 ,
        expression="record.subject"
    ),
    list(
        name='study',
        title='STUDY',
        description='Study Code',
        ordering=2 ,
        expression="None if value == 'UNASSIGNED' else value",
        data_type='string'
    ),
    list(
        data_type="string",
        depends_on=list("hgvs_c"),
        description="SolveBio variant entity, computed from the short variant CDS change",
        expression="entity_ids('variant', record.hgvs_c) if record.hgvs_c else None",
        is_transient=TRUE,
        name="variant_cdna_grch38"
    )
)

The following attributes should be added:

  • name - the name of the field
  • data_type - the data type of the field
  • entity_type - the entity type (only necessary for entity querying)

The following attributes are optional, but responsible for much of the data transformation:

  • expression - The expression that will be evaluated to populate this field's value. Put "value" to use the current value. See Expressions documentation. Note you want to use data from another field (for comparison, splits, etc), please be sure that you also added it to the list of fields and then you can get it like record.name_of_field. Do not forget to add it in 'depends_on' list.
  • depends_on - This is a list of fields that your expression depends on. Add any field names here. This will ensure that those fields expressions are evaluated before its dependents. The template creation will fail if there is a circular dependency.
  • is_transient - A transient field is a field that is not indexed into the dataset, but calculated only while the template annotation is running. This is useful for temporary fields/variables for complex templates (default is False)

Clean Templates create clean Datasets!

Field attributes are applied to the dataset when a field is used for the first time. It is best practice to add titles, descriptions, ordering and the is_hidden value to your fields. This helps to document your datasets and makes them easier to work with and understand.

The following attributes are optional, and informational only, but encouraged:

  • title - The field's display name, shown in the UI and in CSV/Excel exports.
  • description - Describes the contents of the field, shown in the UI.
  • ordering - The order in which this column appears when retrieving data from the dataset. Order is 0-based. Default is 0
  • is_hidden - Set to True if the field should be excluded by default from the UI.

After the list of the fields is prepared, other information about a template can be added, like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
template = {
    "name": "My Variant Template",
    "version": '1.2.0',
    "description": 'Import a special CSV file. Genome is assumed to be GRCh38, also has variant entity for GRCh37.',
    "template_type": "dataset",
    "is_public": False,
    "entity_params": {
        'disable': True
    },
    "fields": fields
}
1
2
3
4
5
6
7
8
9
template <- list(
    name="My Variant Template",
    version='1.2.0',
    description='Import a special CSV file. Genome is assumed to be GRCh38, also has variant entity for GRCh37.',
    template_type="dataset",
    is_public=FALSE,
    entity_params=list('disable'=TRUE),
    fields=fields
)

Notes

Always set template_type to dataset! The variable is_public should NOT be set to True! This will give all SolveBio users access to the template.

If you want the template to be shown in the UI (in the modal used for transforming files), add the following tag to the template:

1
"tags": ['import']
1
tags <- list("import")

After that create the template:

1
2
my_template = solvebio.DatasetTemplate.create(**template)
print(my_template)
1
2
3
4
5
6
7
my_template = DatasetTemplate.create(
    name="My Variant Template",
    version='1.2.0',
    description='Import a special CSV file. Genome is assumed to be GRCh38, also has variant entity for GRCh37.',
    template_type="dataset",
    is_public=FALSE,
    fields=fields)

Printing the template object will show the template's ID and contents.

Create a dataset with a template

You can create a dataset and set the structure with a template

1
2
3
4
5
6
7
8
9
template = DatasetTemplate.retrieve('id of your template')
dataset = Dataset.get_or_create_by_full_path('your dataset path', fields=template.fields)

# Dataset will now have the non-transient fields from the template
# with desired titles/descriptions and expressions
print(dataset.fields())

# But no records
print(dataset.documents_count)
1
2
3
4
5
6
7
8
template = DatasetTemplate.retrieve('id of your template')
#Specify where you want to create your new dataset
vault <- Vault.get_personal_vault()
dataset_full_path <- paste(vault$full_path, "my_fields_dataset", sep="/")

# Dataset will now have the non-transient fields from the template
# with desired titles/descriptions and expressions, but no records
dataset = Dataset.get_or_create_by_full_path(dataset_full_path, fields=template$fields)

You can also create and a dataset and add the fields during file import

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
template = DatasetTemplate.retrieve('id of your template')
dataset = Dataset.get_or_create_by_full_path('your dataset path')

# Only field should be "id"
print(dataset.fields())

file_object = Object.retrieve('id of file uploaded to SolveBio')
DatasetImport.create(
    dataset_id=dataset.id,
    object_id=file_object.id,
    **template_params.import_params,
    commit_mode='append',
)
# Wait for import to finish
dataset.activity(follow=True)

# Should now see all the non-transient fields from the template!
print(dataset.fields())
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
template = DatasetTemplate.retrieve('id of your template')
#Specify where you want to create your new dataset
vault <- Vault.get_personal_vault()
dataset_full_path <- paste(vault$full_path, "my_fields_dataset", sep="/")

dataset = Dataset.get_or_create_by_full_path(dataset_full_path, fields=template$fields)

file_object = Object.retrieve('id of file uploaded to SolveBio')
DatasetImport.create(
    dataset_id=dataset$id,
    object_id=file_object$id,
    template_params$import_params,
    commit_mode='append',
)
# Wait for import to finish
dataset.activity(follow=True)

Building and Testing templates with the annotator

When creating new templates it it is useful to use the annotator to test and validate the fields and their expressions. The below snippet will use the annotator to process records in real time with your template fields.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
from solvebio import Dataset
from solvebio import DatasetTemplate

# Load fields from a template
dataset = Dataset.get_by_full_path('vault:/my/dataset/')
template = DatasetTemplate.retrieve(template_id)

# Retrieve and annotate records with the dataset template fields
records = dataset.query()
for record in records.annotate(template.fields):
    print(record)

# Annotate records server side (most efficient)
records = dataset.query(target_fields=template.fields)

# Use the Annotator class
ann = Annotator(fields=template.fields)
records = dataset.query()
for record in ann.annotate(records):
    print(record)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
require(solvebio)

# Get records from dataset
records = Dataset.get_or_create_by_full_path(dataset_full_path).query()

# Load files from a template
template <- DatasetTemplate.retrieve(template_id)
fields = template$fields

Annotator.annotate(records=records, fields=fields)

Next steps: Learn more about using templates to transform datasets