Transforming Data¶

Overview¶

SolveBio makes it easy to transform data using its dynamic, Python-based expression language. You can use expressions to transform data when importing files, when copying data between (or within) datasets (using "migrations"), or even when querying datasets. In all scenarios, expressions can be provided through the target_fields parameter. To learn more about expressions, please visit the Expressions section.

This article describes how to transform data using dataset migrations, but you can use the same techniques with dataset imports. With dataset migrations, you can copy data between datasets as well as modify datasets in-place. This makes it possible to add, edit, and remove fields. All dataset migrations have a source dataset and a target dataset (which can be the same when editing a single dataset).

Copy a Dataset¶

Copying a dataset is the simplest dataset migration. In this case, a new dataset is created (the target) and all the records from a source dataset are copied to it. The source dataset remains unchanged. Copying datasets can be useful if you do not want to alter the source dataset, or if you do not have write access to the source dataset.

If you do not want to copy the entire source dataset, you can provide filter parameters to copy a subset. This example copies all BRCA1 variants from ClinVar into a new dataset:

Python R

from solvebio import Dataset

# Retrieve the source dataset
source = Dataset.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37')
# Create your new target dataset
target = Dataset.get_or_create_by_full_path('~/python_examples/clinvar')

# Copy all variants in BRCA1
query = source.query().filter(gene_symbol='BRCA1')
query.migrate(target=target)

library(solvebio)

# Retrieve the source dataset
source_dataset <- Dataset.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37')

# Create your new target dataset
vault <- Vault.get_personal_vault()
dataset_full_path <- paste(vault$full_path, "/r_examples/clinvar", sep=":")
target_dataset <- Dataset.get_or_create_by_full_path(dataset_full_path)

# Copy all variants in BRCA1
migration <- DatasetMigration.create(
    source_id=source_dataset$id,
    target_id=target_dataset$id,
    # Omit source_params to copy the whole dataset
    source_params=list(
        filters=list(
            list('gene_symbol', 'BRCA1')
        )
    )
)

Add Fields¶

The most common dataset transformation is to add a field to a dataset (also known as annotating the dataset or inserting a column). Fields can be added or modified using the target_fields parameter, which should contain a list of valid dataset fields. Any new fields in target_fields will be automatically detected and added to the dataset's schema. Adding fields requires the use of the upsert or overwrite commit mode, depending on the desired effect. This will ensure that the records are updated in-place (based on their _id value), and not duplicated.

A collection of fields is a Dataset Template

To add multiple fields and transform data in a specific way, use a Dataset Template

In this example, a new field will is added to a dataset "in-place", using upsert commit mode.

This example follows from the previous one

To run this example, make sure you have created your own copy of ClinVar.

Python R

from solvebio import Dataset

dataset = Dataset.get_or_create_by_full_path('~/python_examples/clinvar')

fields = [
    {
        'name': 'clinsig_clone',
        'expression': 'record.clinical_significance',
        'description': 'A copy of the current clinical_significance value',
        'data_type': 'string'
    }
]

# The source and target are the same dataset, which edits the dataset in-place
dataset.migrate(target=dataset, target_fields=fields, commit_mode='upsert')

library(solvebio)

vault <- Vault.get_personal_vault()
dataset_full_path <- paste(vault$full_path, "/r_examples/clinvar", sep=":")
dataset <- Dataset.get_or_create_by_full_path(dataset_full_path)

fields = list(
    list(
        name='clinsig_clone',
        expression='record.clinical_significance',
        data_type='string'
    )
)

# The source and target are the same dataset, which edits the dataset in-place
DatasetMigration.create(
    source_id=dataset$id,
    target_id=dataset$id,
    target_fields=fields,
    source_params=list(limit=100000),
    commit_mode='upsert'
)

Recipes¶

A field can also be added using a recipe, which is the UI equivalent of inserting a column.

Recipe modal

Preset recipes are available for use for commonly created fields. A custom recipe can also be built using a blank recipe and constructing the field and expression manually.

Adding a column that already exists will overwrite data

Note that if the column that is specified already exists in the dataset, data will be overwritten.

Parallel Mode (beta)

NOTE: This feature is currently in testing and only available to some users.

In the UI, an additional integer value called parallel can also be set to split a single migration into multiple parallel tasks, attempting to improve run time. While the value defaults to 1 (run a single task), any integer between 1 and 50 can be chosen.

Edit Fields¶

In this example, an existing field is modified (converted to uppercase). Similar to the example above, a commit mode of overwrite or upsert must be used here to avoid duplicating records.

This example uses an expression that references a pre-existing field in the dataset (learn more about expression context).

This example follows from the previous one

To run this example, make sure you have created your own copy of ClinVar with the new field added.

Python R

from solvebio import Dataset

dataset = Dataset.get_or_create_by_full_path('~/python_examples/clinvar')

fields = [
    {
        # Convert your copied "clinical_significance" values to uppercase.
        'name': 'clinsig_clone',
        'data_type': 'string',
        'expression': 'value.upper()'
    }
]

# The source and target are the same dataset, which edits the dataset in-place
dataset.migrate(target=dataset, target_fields=fields, commit_mode='upsert')

library(solvebio)

vault <- Vault.get_personal_vault()
dataset_full_path <- paste(vault$full_path, "/r_examples/clinvar", sep=":")
dataset <- Dataset.get_or_create_by_full_path(dataset_full_path)

fields <- list(
    list(
        # Convert your copied "clinsig_clone" values to uppercase.
        name= 'clinsig_clone',
        data_type='string',
        expression='value.upper()'
    )
)

# Run a migration where the source and target are the same dataset
DatasetMigration.create(
    source_id=dataset$id,
    target_id=dataset$id,
    target_fields=fields,
    commit_mode='upsert'
)

Remove Fields¶

Removing a field requires a new empty target dataset

Since you cannot remove a field from a dataset in-place, all the data must be migrated to a new target dataset with the removed field(s) excluded from the source dataset.

In this example, the field (clinsig_clone) in the source dataset is removed by the dataset migration. Since we are removing a field, the target dataset must be a new dataset (or one without the field). In this scenario, any commit mode can be used unless you intend to overwrite records in the target dataset.

This example follows from the previous one

To run this example, make sure you have created your own copy of ClinVar with the new field added.

Python R

from solvebio import Dataset

# Use the dataset from the example above
source = Dataset.get_or_create_by_full_path('~/python_examples/clinvar')

# To remove a field, you need to create a new, empty dataset first.
target = Dataset.get_or_create_by_full_path('~/python_examples/clinvar_lite')

# Exclude the copied field from the example above
query = source.query(exclude_fields=['clinsig_clone'])
query.migrate(target=target, commit_mode='upsert')

library(solvebio)

vault <- Vault.get_personal_vault()

# Use the dataset from the example above
source_full_path <- paste(vault$full_path, "/r_examples/clinvar", sep=":")
source <- Dataset.get_by_full_path(source_full_path)

# To remove a field, you need to create a new, empty dataset first.
target_full_path <- paste(vault$full_path, "/r_examples/clinvar_lite", sep=":")
target <- Dataset.get_or_create_by_full_path(target_full_path)

# Exclude the copied field from the example above
DatasetMigration.create(
    source_id=source$id,
    target_id=target$id,
    source_params=list(
        exclude_fields=list('clinsig_clone')
    ),
    commit_mode='upsert'
)

If you just want to remove the data from a specific field, run an upsert migration and use an expression to set the values to None (Python's equivalent to NULL).

Transient Fields¶

Transient fields are like variables in a programming language. They can be used in a complex transform that requires intermediate values that you do not want to store in the dataset. Transient fields can be referenced by other expressions, but are not added to the dataset's schema or stored. Just add is_transient=True and ensure that the field's ordering evaluates the transient fields in the right order.

In the following example we will use transient fields to structure a few VCF records, leaving SolveBio variant IDs and dbSNP rsIDs in the resulting dataset:

Python R

from solvebio import Dataset

dataset = Dataset.get_or_create_by_full_path('~/python_examples/transient_test')

# Sample of a VCF file
vcf = """CHROM POS     ID        REF ALT    QUAL FILTER INFO                              FORMAT      NA00001        NA00002        NA00003
20     14370   rs6054257 G      A       29   PASS   NS=3;DP=14;AF=0.5;DB;H2           GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20     17330   .         T      A       3    q10    NS=3;DP=11;AF=0.017               GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3   0/0:41:3
20     1110696 rs6040355 A      G     67   PASS   NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2   2/2:35:4
20     1230237 .         T      T       47   PASS   NS=3;DP=13;AA=T                   GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20     1234567 microsat1 GTCT   GTACT 50   PASS   NS=3;DP=9;AA=G                    GT:GQ:DP    0/1:35:4       0/2:17:2       1/1:40:3
""".splitlines()
# Extract the records (without the header)
records = [{'row': row} for row in vcf[1:]]

target_fields = [
    {
        "name": "header",
        "is_transient": True,
        "ordering": 1,
        "data_type": "string",
        "is_list": True,
        "expression": vcf[0].split()
    },
    {
        "name": "row",
        "is_transient": True,
        "ordering": 1,
        "data_type": "object",
        "expression": "dict(zip(record.header, value.split()))"
    },
    {
        "name": "build",
        "is_transient": True,
        "ordering": 1,
        "data_type": "string",
        "expression": "'GRCh37'"
    },
    {
        "name": "variant",
        "ordering": 2,
        "data_type": "string",
        "entity_type": "variant",
        "expression": """
            '-'.join([
                record.build,
                record.row['CHROM'],
                record.row['POS'],
                record.row['POS'],
                record.row['ALT']
            ])
        """
    },
    {
        "name": "rsid",
        "data_type": "string",
        "ordering": 2,
        "expression": "get(record, 'row.ID') if get(record, 'row.ID') != '.' else None"
    }
]

imp = DatasetImport.create(
    dataset_id=dataset.id,
    records=records,
    target_fields=target_fields)

dataset.activity(follow=True)

for record in dataset.query(exclude_fields=['_id', '_commit']):
    print(record)

# {'rsid': 'rs6054257' , 'variant': 'GRCh37-20-14370-14370-A'}
# {'rsid': None        , 'variant': 'GRCh37-20-17330-17330-A'}
# {'rsid': 'rs6040355' , 'variant': 'GRCh37-20-1110696-1110696-G'}
# {'rsid': None        , 'variant': 'GRCh37-20-1230237-1230237-T'}
# {'rsid': 'microsat1' , 'variant': 'GRCh37-20-1234567-1234567-GTACT'}

library(solvebio)

vault <- Vault.get_personal_vault()
dataset_full_path <- paste(vault$full_path, "/r_examples/variants-transient", sep=":")
dataset <- Dataset.get_or_create_by_full_path(dataset_full_path)

# translate_variant() returns an object with 8+ fields.
# since we do not want all those fields added to the dataset
# we make this field transient and pull the gene and protein change values
# into their own fields
target_fields <- list(
    list(name="translated_variant_transient",
         description="Transient fields that runs variant translation expression",
         is_transient=TRUE,
         data_type="object",
         ordering=1,
         expression="translate_variant(record.variant)"),
    list(name="gene",
         description="HUGO Gene Symbol",
         data_type="string",
         ordering=2,
         expression="get(record, 'translated_variant_transient.gene')"),
    list(name="protein_change",
         data_type="string",
         ordering=2,
         expression="get(record, 'translated_variant_transient.protein_change')")
)

# Import these records into a dataset
records <- list(
        list(variant='GRCH37-17-41244429-41244429-T'),
        list(variant='GRCH37-3-37089131-37089131-T')
)

# Add columns gene/protein_change, via transient column "translated_variant"
imp <- DatasetImport.create(dataset_id = dataset$id,
                            records = records,
                            commit_mode = 'upsert',
                            target_fields = target_fields)

Dataset.activity(dataset$id)

Dataset.query(id = dataset$id, exclude_fields=list('_id', '_commit'))
#      gene            variant                   protein_change
#   1  BRCA1  GRCH37-17-41244429-41244429-T       p.S1040N
#   2  MLH1   GRCH37-3-37089131-37089131-T        p.K618M

Transforming Queries¶

The output of regular dataset queries can be transformed using the target_fields parameter. This makes it possible to supplement the information returned from a query in real-time, without having to modify the underlying dataset.

The following example retrieves a number of variants from ClinVar and annotates the records with gnomAD population allele frequencies:

Python R

from solvebio import Dataset

target_fields = [{
    'name': 'gnomad_af',
    'data_type': 'float',
    'expression': """
        dataset_field_values(
            'solvebio:public:/gnomAD/2.1.1/Exomes-GRCh38',
            field='af',
            entities=[('variant', record.variant)])
    """
}]

dataset = Dataset.get_by_full_path('solvebio:public:/ClinVar/5.1.0-20200720/Variants-GRCH38')

query = dataset.query(target_fields=target_fields).filter(gene='BRCA2')
for row in query[:10]:
    print(row['variation_id'], row['variant'], row['gnomad_af'])

library(solvebio)

dataset <- Dataset.get_by_full_path("solvebio:public:/ClinVar/5.1.0-20200720/Variants-GRCH38")
filters <- '[["gene__exact","BRCA2"]]'
fields <- list(
    list(
        name='gnomad_af',
        data_type='float',
        expression='dataset_field_values("solvebio:public:/gnomAD/2.1.1/Exomes-GRCh38", field="af", entities=[("variant", record.variant)])'
    )
)
query <- Dataset.query(dataset$id, filters=filters, target_fields=fields, limit=10)

Modifying Datasets¶

Overwrite Records¶

In order to completely overwrite specific records in a dataset just use the overwrite commit mode. You'll need to know the _id of the records you wish to overwrite, but you can get that by querying the dataset.

In this example we will import a few records and then edit one of them:

Python R

from solvebio import Dataset
from solvebio import DatasetImport

dataset = Dataset.get_or_create_by_full_path('~/python_examples/edit_records')

# Initial Import
imp = DatasetImport.create(
    dataset_id=dataset.id,
    records=[
        {'name': 'Francis Crick', 'birth_year': '1916'},
        {'name': 'James Watson', 'birth_year': '1928'},
        {'name': 'Rosalind Franklin', 'birth_year': '1920'}
    ]
)

# Retrieve the record to edit
record = dataset.query().filter(name='Francis Crick')[0]
record['name'] = 'Francis Harry Compton Crick'
record['awards'] = ['Order of Merit', 'Fellow of the Royal Society']

# Overwrite mode
imp = DatasetImport.create(
    dataset_id=dataset.id,
    records=[record],
    commit_mode='overwrite'
)

# Lookup record by ID and see the edited record
print(dataset.query().filter(_id=record['_id']))

library(solvebio)

vault <- Vault.get_personal_vault()
dataset_full_path <- paste(vault$full_path, "/r_examples/transform_overwrite", sep=":")
dataset <- Dataset.get_or_create_by_full_path(dataset_full_path)

# Initial records to import
records <- list(
        list(name='Francis Crick', birth_year='1916'),
        list(name='James Watson', birth_year='1928'),
        list(name='Rosalind Franklin', birth_year='1920')
)
imp <- DatasetImport.create(dataset_id = dataset$id,
                            records = records)

# Get record and change some data
record <- Dataset.query(id = dataset$id, filters='[["name", "Francis Crick"]]')[1,]
record['name'] = 'Francis Harry Compton Crick'
record['awards'] = list('Order of Merit', 'Fellow of the Royal Society')

# Overwrite mode
imp <- DatasetImport.create(dataset_id = dataset$id,
                            records = list(as.list(record)),
                            commit_mode = 'overwrite')

# Show the result
filters = list(
    list("_id", record$"_id")
)
Dataset.query(id = dataset$id, filters = filters)

Pro Tip

To clear all values in a field, try setting the expression to None and run the migration.

Upsert (Edit) Records¶

In order to only update (or add) specific fields in a dataset use the upsert commit mode. You'll need to know the _id of the records you wish to upsert, but you can get that by querying the dataset.

Similar to the example above, we will import a few records and then edit one of them:

Python R

from solvebio import Dataset
from solvebio import DatasetImport

dataset = Dataset.get_or_create_by_full_path('~/python_examples/upsert_records')

# Initial Import
imp = DatasetImport.create(
    dataset_id=dataset.id,
    records=[
        {'name': 'Francis Crick', 'birth_year': '1916'},
        {'name': 'James Watson', 'birth_year': '1928'},
        {'name': 'Rosalind Franklin', 'birth_year': '1920'}
    ]
)

# Retrieve the record to edit
record = dataset.query().filter(name='Francis Crick')[0]
# Change existing field
record['name'] = 'Francis Harry Compton Crick'
# Add a new field
record['birthplace'] = 'Northampton, England'

# Upsert mode
imp = DatasetImport.create(
    dataset_id=dataset.id,
    records=[record],
    commit_mode='upsert'
)

# Lookup record by ID and see the edited record
print(dataset.query().filter(_id=record['_id']))

library(solvebio)

vault <- Vault.get_personal_vault()
dataset_full_path <- paste(vault$full_path, "/r_examples/transform_upsert", sep=":")
dataset <- Dataset.get_or_create_by_full_path(dataset_full_path)

# Initial records to import
records <- list(
        list(name='Francis Crick', birth_year='1916'),
        list(name='James Watson', birth_year='1928'),
        list(name='Rosalind Franklin', birth_year='1920')
)
imp <- DatasetImport.create(dataset_id = dataset$id,
                            records = records)

# Get record and change some data
record <- Dataset.query(id = dataset$id, filters='[["name", "Francis Crick"]]')[1,]
record['name'] = 'Francis Harry Compton Crick'
record['birthplace'] = 'Northampton, England'

# Upsert mode
imp <- DatasetImport.create(dataset_id = dataset$id,
                            records = list(as.list(record)),
                            commit_mode = 'upsert')

# Show the result
filters = list(
    list("_id", record$"_id")
)
Dataset.query(id = dataset$id, filters = filters)

Delete Records¶

In order to completely delete a record from a dataset you may use the delete commit mode, and pass a list of record IDs (from their _id field).

You can delete records via an import (if you have a file or list of record IDs) or via a migration (if you are deleting the results of a dataset query).

Delete via Import¶

Python R

from solvebio import Dataset
from solvebio import DatasetImport

dataset = Dataset.get_or_create_by_full_path('~/python_examples/data_delete')

# Initial Import
imp = DatasetImport.create(
    dataset_id=dataset.id,
    records=[
        {'name': 'six'},
        {'name': 'seven'},
        {'name': 'eight'},
        {'name': 'nine'}
    ]
)

for r in dataset.query(fields=['name']):
    print(r)
# {'name': 'six'}
# {'name': 'seven'}
# {'name': 'eight'}
# {'name': 'nine'}

# Get the record ID for 'nine'
record = dataset.query(fields=['_id']).filter(name='nine')

# Delete the record
imp = DatasetImport.create(
    dataset_id=dataset.id,
    records=list(record),
    commit_mode='delete'
)

# Why was six afraid of seven?
for r in dataset.query(fields=['name']):
    print(r)
# {'name': 'six'}
# {'name': 'seven'}
# {'name': 'eight'}

library(solvebio)

vault <- Vault.get_personal_vault()
dataset_full_path <- paste(vault$full_path, "/r_examples/data_delete", sep=":")
dataset <- Dataset.get_or_create_by_full_path(dataset_full_path)

# Initial records to import
records <- list(
        list(name='six'),
        list(name='seven'),
        list(name='eight'),
        list(name='nine')
)
imp <- DatasetImport.create(dataset_id = dataset$id,
                            records = records)

Dataset.query(id = dataset$id, fields=list('name'))
#   name
#   1 six
#   2 seven
#   3 eight
#   4 nine

# Get the record ID for 'nine'
record <- Dataset.query(id = dataset$id, filters='[["name", "nine"]]', fields=list("_id"))

# Delete mode
imp <- DatasetImport.create(dataset_id = dataset$id,
                            records = list(as.list(record)),
                            commit_mode = 'delete')


Dataset.query(id = dataset$id, fields=list('name'))
#   name
#   1 six
#   2 seven
#   3 eight

Delete via Migration¶

Python R

from solvebio import Dataset
from solvebio import DatasetImport

dataset = Dataset.get_or_create_by_full_path('~/python_examples/data_delete_migration')

# Initial Import
imp = DatasetImport.create(
    dataset_id=dataset.id,
    records=[
        {'name': 'Alice'},
        {'name': 'Bob'},
        {'name': 'Carol'},
        {'name': 'Chuck'},
        {'name': 'Craig'},
        {'name': 'Dan'},
        {'name': 'Eve'}
    ]
)

# Get the records where names begin with C
query = dataset.query().filter(name__prefix='C')

# Use migration shortcut to Delete
migration = query.migrate(dataset, commit_mode="delete")

# The above shortcut is equivalent to:
# migration = DatasetMigration.create(
#    source_id=dataset.id,
#    target_id=dataset.id,
#    commit_mode="delete",
#    source_params=dict(filters=[('name__prefix', 'C')])
# )

for r in dataset.query():
    print(r)

# {'name': 'Alice'},
# {'name': 'Bob'},
# {'name': 'Dan'},
# {'name': 'Eve'}

library(solvebio)

vault <- Vault.get_personal_vault()
dataset_full_path <- paste(vault$full_path, "/r_examples/data_delete_migration", sep=":")
dataset <- Dataset.get_or_create_by_full_path(dataset_full_path)

# Initial records to import
records <- list(
        list(name='Alice'),
        list(name='Bob'),
        list(name='Carol'),
        list(name='Chuck'),
        list(name='Craig'),
        list(name='Dan'),
        list(name='Eve')
)
imp <- DatasetImport.create(dataset_id = dataset$id,
                            records = records)

# Get the records where names begin with C
records <- Dataset.query(id = dataset$id, filters='[["name__prefix", "c"]]')

# Delete mode
migration <- DatasetMigration.create(
        source_id = dataset$id,
        target_id = dataset$id,
        source_params=list(filters=list(list("name__prefix", "C"))),
        commit_mode = 'delete')

Dataset.query(id = dataset$id, fields=list('name'))
#   name
#   1 Alice
#   2 Bob
#   3 Dan
#   4 Eve

Last updated 2022-12-07.

Have questions or comments about this article? Get in touch with SolveBio Support by submitting a ticket or by sending us an email.