Skip to content

Transforming Data

Overview

SolveBio makes it easy to transform data using its dynamic, Python-based expression language. You can use expressions to transform data when importing files, or when working with datasets (through "dataset migrations"). In both scenarios, expressions can be provided through the target_fields parameter.

This article describes how to transform data using dataset migrations, but you can use the same techniques with dataset imports. With dataset migrations, you can copy data between datasets as well as modify datasets in-place. This makes it possible to add, edit, and remove fields. All dataset migrations have a source dataset and a target dataset (which can be the same when editing a single dataset).

Copy a Dataset

Copying a dataset is the simplest dataset migration. In this case, a new dataset is created (the target) and all the records from a source dataset are copied to it. The source dataset remains unchanged. Copying datasets can be useful if you do not want to alter the source dataset, or if you do not have write access to the source dataset.

If you do not want to copy the entire source dataset, you can provide filter parameters to copy a subset. This example copies all BRCA1 variants from ClinVar into a new dataset:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from solvebio import Dataset

# Retrieve the source dataset
source = Dataset.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37')
# Create your new target dataset
target = Dataset.get_or_create_by_full_path('~/python_examples/clinvar_copy')

# Copy all variants in BRCA1
query = source.query().filter(gene_symbol='BRCA1')
query.migrate(target=target)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
library(solvebio)

# Retrieve the source dataset
source_dataset <- Dataset.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37')

# Create your new target dataset
vault <- Vault.get_personal_vault()
dataset_full_path <- paste(vault$full_path, "/r_examples/clinvar_copy", sep=":")
target_dataset <- Dataset.get_or_create_by_full_path(dataset_full_path)

# Copy all variants in BRCA1
migration <- DatasetMigration.create(
    source_id=source_dataset$id,
    target_id=target_dataset$id,
    # Omit source_params to copy the whole dataset
    source_params=list(
        filters=list(
            list('gene_symbol', 'BRCA1')
        )
    )
)

Add Fields

The most common dataset transformation is to add a field to a dataset (also known as annotating the dataset). Fields can be added or modified using the target_fields parameter, which should contain a list of valid dataset fields. Any new fields in target_fields will be automatically detected and added to the dataset's schema. Adding fields requires the use of the upsert or overwrite commit mode, depending on the desired effect. This will ensure that the records are updated in-place (based on their _id value), and not duplicated.

In this example, a new field will is added to a dataset "in-place", using upsert mode.

This example follows from the previous one

To run this example, make sure you have created your own copy of ClinVar.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
from solvebio import Dataset

dataset = Dataset.get_or_create_by_full_path('~/python_examples/clinvar_copy')

fields = [
    {
        'name': 'clinical_significance_copy',
        'expression': 'record.clinical_significance',
        'data_type': 'string'
    }
]

# Run a migration where the source and target are the same
dataset.migrate(target=dataset, target_fields=fields, commit_mode='upsert')
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
library(solvebio)

vault <- Vault.get_personal_vault()
dataset_full_path <- paste(vault$full_path, "/r_examples/clinvar_copy", sep=":")
dataset <- Dataset.get_or_create_by_full_path(dataset_full_path)

fields = list(
    list(
        name='clinical_significance_copy',
        expression='record.clinical_significance',
        data_type='string'
    )
)

# The source and target are the same dataset, which edits the dataset in-place
DatasetMigration.create(
    source_id=dataset$id,
    target_id=dataset$id,
    target_fields=fields,
    source_params=list(limit=100000),
    commit_mode='upsert'
)

Edit Fields

In this example, an existing field is modified (converted to uppercase). Similar to the example above, a commit mode of overwrite or upsert must be used here to avoid duplicating records.

This example uses an expression that references a pre-existing field in the dataset (learn more about expression context).

This example follows from the previous one

To run this example, make sure you have created your own copy of ClinVar with the new field added.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from solvebio import Dataset

dataset = Dataset.get_or_create_by_full_path('~/python_examples/clinvar_copy')

fields = [
    {
        # Convert your copied "clinical_significance" values to uppercase.
        'name': 'clinical_significance_copy',
        'data_type': 'string',
        'expression': 'value.upper()'
    }
]

# Run a migration where the source and target are the same dataset
dataset.migrate(target=dataset, target_fields=fields, commit_mode='upsert')
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
library(solvebio)

vault <- Vault.get_personal_vault()
dataset_full_path <- paste(vault$full_path, "/r_examples/clinvar_copy", sep=":")
dataset <- Dataset.get_or_create_by_full_path(dataset_full_path)

fields <- list(
    list(
        # Convert your copied "clinical_significance_copy" values to uppercase.
        name= 'clinical_significance_copy',
        data_type='string',
        expression='value.upper()'
    )
)

DatasetMigration.create(
    source_id=dataset$id,
    target_id=dataset$id,
    target_fields=fields,
    commit_mode='upsert'
)

Remove Fields

Removing a field requires a new empty target dataset

Since you cannot remove a field from a dataset in-place, all the data must be migrated to a new target dataset with the removed field(s) excluded from the source dataset.

In this example, the field (clinical_significance_copy) in the source dataset is removed by the dataset migration. Since we are removing a field, the target dataset must be a new dataset (or one without the field). In this scenario, any commit mode can be used unless you intend to overwrite records in the target dataset.

This example follows from the previous one

To run this example, make sure you have created your own copy of ClinVar with the new field added.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
from solvebio import Dataset

# Use the dataset from the example above
source = Dataset.get_or_create_by_full_path('~/python_examples/clinvar_copy')

# To remove a field, you need to create a new, empty dataset first.
target = Dataset.get_or_create_by_full_path('~/python_examples/clinvar_copy_2')

# Exclude the copied field from the example above
query = source.query(exclude_fields=['clinical_significance_copy'])
query.migrate(target=target, commit_mode='upsert')
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
library(solvebio)

vault <- Vault.get_personal_vault()

# Use the dataset from the example above
source_full_path <- paste(vault$full_path, "/r_examples/clinvar_copy", sep=":")
source <- Dataset.get_by_full_path(source_full_path)

# To remove a field, you need to create a new, empty dataset first.
target_full_path <- paste(vault$full_path, "/r_examples/clinvar_copy_2", sep=":")
target <- Dataset.get_or_create_by_full_path(target_full_path)

# Exclude the copied field from the example above
DatasetMigration.create(
    source_id=source$id,
    target_id=target$id,
    source_params=list(
        exclude_fields=list('clinical_significance_copy')
    ),
    commit_mode='upsert'
)

If you just want to remove the data from a specific field, run an upsert migration and use an expression to set the values to None (Python's equivalent to NULL).

Transient Fields

Transient fields are like variables in a programming language. They can be used in a complex transform that requires intermediate values that you do not want to store in the dataset. Transient fields can be referenced by other expressions, but are not added to the dataset's schema or stored. Just add is_transient=True and ensure that the field's ordering evaluates the transient fields in the right order.

In the following example we will use transient fields to structure a few VCF records, leaving SolveBio variant IDs and dbSNP rsIDs in the resulting dataset:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
from solvebio import Dataset

dataset = Dataset.get_or_create_by_full_path('~/python_examples/transient_test')

# Sample of a VCF file
vcf = """CHROM POS     ID        REF ALT    QUAL FILTER INFO                              FORMAT      NA00001        NA00002        NA00003
20     14370   rs6054257 G      A       29   PASS   NS=3;DP=14;AF=0.5;DB;H2           GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20     17330   .         T      A       3    q10    NS=3;DP=11;AF=0.017               GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3   0/0:41:3
20     1110696 rs6040355 A      G     67   PASS   NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2   2/2:35:4
20     1230237 .         T      T       47   PASS   NS=3;DP=13;AA=T                   GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20     1234567 microsat1 GTCT   GTACT 50   PASS   NS=3;DP=9;AA=G                    GT:GQ:DP    0/1:35:4       0/2:17:2       1/1:40:3
""".splitlines()
# Extract the records (without the header)
records = [{'row': row} for row in vcf[1:]]

target_fields = [
    {
        "name": "header",
        "is_transient": True,
        "ordering": 1,
        "data_type": "string",
        "is_list": True,
        "expression": vcf[0].split()
    },
    {
        "name": "row",
        "is_transient": True,
        "ordering": 1,
        "data_type": "object",
        "expression": "dict(zip(record.header, value.split()))"
    },
    {
        "name": "build",
        "is_transient": True,
        "ordering": 1,
        "data_type": "string",
        "expression": "'GRCh37'"
    },
    {
        "name": "variant",
        "ordering": 2,
        "data_type": "string",
        "entity_type": "variant",
        "expression": """
            '-'.join([
                record.build,
                record.row['CHROM'],
                record.row['POS'],
                record.row['POS'],
                record.row['ALT']
            ])
        """
    },
    {
        "name": "rsid",
        "data_type": "string",
        "ordering": 2,
        "expression": "get(record, 'row.ID') if get(record, 'row.ID') != '.' else None"
    }
]

imp = DatasetImport.create(
    dataset_id=dataset.id,
    data_records=records,
    target_fields=target_fields)

dataset.activity(follow=True)

for record in dataset.query(exclude_fields=['_id', '_commit']):
    print record

# {'rsid': 'rs6054257' , 'variant': 'GRCh37-20-14370-14370-A'}
# {'rsid': None        , 'variant': 'GRCh37-20-17330-17330-A'}
# {'rsid': 'rs6040355' , 'variant': 'GRCh37-20-1110696-1110696-G'}
# {'rsid': None        , 'variant': 'GRCh37-20-1230237-1230237-T'}
# {'rsid': 'microsat1' , 'variant': 'GRCh37-20-1234567-1234567-GTACT'}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
library(solvebio)

vault <- Vault.get_personal_vault()
dataset_full_path <- paste(vault$full_path, "/r_examples/variants-transient", sep=":")
dataset <- Dataset.get_or_create_by_full_path(dataset_full_path)

# translate_variant() returns an object with 8+ fields.
# since we do not want all those fields added to the dataset
# we make this field transient and pull the gene and protein change values
# into their own fields
target_fields <- list(
    list(name="translated_variant_transient",
         description="Transient fields that runs variant translation expression",
         is_transient=TRUE,
         data_type="object",
         ordering=1,
         expression="translate_variant(record.variant)"),
    list(name="gene",
         description="HUGO Gene Symbol",
         data_type="string",
         ordering=2,
         expression="get(record, 'translated_variant_transient.gene')"),
    list(name="protein_change",
         data_type="string",
         ordering=2,
         expression="get(record, 'translated_variant_transient.protein_change')")
)

# Import these records into a dataset
records <- list(
        list(variant='GRCH37-17-41244429-41244429-T'),
        list(variant='GRCH37-3-37089131-37089131-T')
)

# Add columns gene/protein_change, via transient column "translated_variant"
imp <- DatasetImport.create(dataset_id = dataset$id,
                            data_records = records,
                            commit_mode = 'upsert',
                            target_fields = target_fields)

Dataset.activity(dataset$id)

Dataset.query(id = dataset$id, exclude_fields=list('_id', '_commit'))
#      gene            variant                   protein_change
#   1  BRCA1  GRCH37-17-41244429-41244429-T       p.S1040N
#   2  MLH1   GRCH37-3-37089131-37089131-T        p.K618M

Overwrite Records

In order to completely overwrite specific records in a dataset just use the overwrite commit mode. You'll need to know the _id of the records you wish to overwrite, but you can get that by querying the dataset.

In this example we will import a few records and then edit one of them:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from solvebio import Dataset
from solvebio import DatasetImport

dataset = Dataset.get_or_create_by_full_path('~/python_examples/edit_records')

# Initial Import
imp = DatasetImport.create(
    dataset_id=dataset.id,
    data_records=[
        {'name': 'Francis Crick', 'birth_year': '1916'},
        {'name': 'James Watson', 'birth_year': '1928'},
        {'name': 'Rosalind Franklin', 'birth_year': '1920'}
    ]
)

# Retrieve the record to edit
record = dataset.query().filter(name='Francis Crick')[0]
record['name'] = 'Francis Harry Compton Crick'
record['awards'] = ['Order of Merit', 'Fellow of the Royal Society']

# Overwrite mode
imp = DatasetImport.create(
    dataset_id=dataset.id,
    data_records=[record],
    commit_mode='overwrite'
)

# Lookup record by ID and see the edited record
print dataset.query().filter(_id=['_id'])
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
library(solvebio)

vault <- Vault.get_personal_vault()
dataset_full_path <- paste(vault$full_path, "/r_examples/transform_overwrite", sep=":")
dataset <- Dataset.get_or_create_by_full_path(dataset_full_path)

# Initial records to import
records <- list(
        list(name='Francis Crick', birth_year='1916'),
        list(name='James Watson', birth_year='1928'),
        list(name='Rosalind Franklin', birth_year='1920')
)
imp <- DatasetImport.create(dataset_id = dataset$id,
                            data_records = records)

# Get record and change some data
record <- Dataset.query(id = dataset$id, filters='[["name", "Francis Crick"]]')[1,]
record['name'] = 'Francis Harry Compton Crick'
record['awards'] = list('Order of Merit', 'Fellow of the Royal Society')

# Overwrite mode
imp <- DatasetImport.create(dataset_id = dataset$id,
                            data_records = list(as.list(record)),
                            commit_mode = 'overwrite')

# Show the result
filters = list(
    list("_id", record$"_id")
)
Dataset.query(id = dataset$id, filters = filters)

Pro Tip

To clear all values in a field, try setting the expression to None and run the migration.

Delete Records

In order to completely delete a record from a dataset you may use the delete commit mode, and pass a list of record IDs (from their _id field).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
from solvebio import Dataset
from solvebio import DatasetImport

dataset = Dataset.get_or_create_by_full_path('~/python_examples/data_delete')

# Initial Import
imp = DatasetImport.create(
    dataset_id=dataset.id,
    data_records=[
        {'name': 'six'},
        {'name': 'seven'},
        {'name': 'eight'},
        {'name': 'nine'}
    ]
)

for r in dataset.query(fields=['name']):
    print r
# {'name': 'six'}
# {'name': 'seven'}
# {'name': 'eight'}
# {'name': 'nine'}

# Get the record ID for 'nine'
record = dataset.query(fields=['_id']).filter(name='nine')

# Delete the record
imp = DatasetImport.create(
    dataset_id=dataset.id,
    data_records=list(record),
    commit_mode='delete'
)

# Why was six afraid of seven?
for r in dataset.query(fields=['name']):
    print r
# {'name': 'six'}
# {'name': 'seven'}
# {'name': 'eight'}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
library(solvebio)

vault <- Vault.get_personal_vault()
dataset_full_path <- paste(vault$full_path, "/r_examples/data_delete", sep=":")
dataset <- Dataset.get_or_create_by_full_path(dataset_full_path)

# Initial records to import
records <- list(
        list(name='six'),
        list(name='seven'),
        list(name='eight'),
        list(name='nine')
)
imp <- DatasetImport.create(dataset_id = dataset$id,
                            data_records = records)

Dataset.query(id = dataset$id, fields=list('name'))
#   name
#   1 six
#   2 seven
#   3 eight
#   4 nine

# Get the record ID for 'nine'
record <- Dataset.query(id = dataset$id, filters='[["name", "nine"]]', fields=list("_id"))

# Delete mode
imp <- DatasetImport.create(dataset_id = dataset$id,
                            data_records = list(as.list(record)),
                            commit_mode = 'delete')


Dataset.query(id = dataset$id, fields=list('name'))
#   name
#   1 six
#   2 seven
#   3 eight