Skip to content

Joining Datasets

Overview

SolveBio provides possibility to join two datasets by keys that are related to each other. The provided datasets join functionality is very similar to well known left join in SQL.

Beta Feature (Python only)

This feature is currently in beta and available in the latest version of the SolveBio Python client (v2.13.0).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
filters = Filter(clinical_significance__exact="pathogenic") & Filter(gene__exact="PTEN")
query_A = Dataset.get_by_full_path('solvebio:public:/ClinVar/5.1.0-20200720/Variants-GRCH37').query(filters=filters, fields=['variant', 'gene'])
query_B = Dataset.get_by_full_path('solvebio:public:/COSMIC/1.1.1-COSMIC71/SomaticMutationsCoding-GRCh37').query(fields=['cosmic_id', 'count'])

# A result of datasets joining is a new Query object
join_query = query_A.join(query_B, key='variant', key_b='variant')

for record in join_query:
    print(record)

# Output:
# {'count': None, 'gene': 'PTEN', 'variant': 'GRCH37-10-89624227-89624227-G', 'cosmic_id': None}
# {'count': None, 'gene': 'PTEN', 'variant': 'GRCH37-10-89624241-89624241-CA', 'cosmic_id': None}
# {'count': 3, 'gene': 'PTEN', 'variant': 'GRCH37-10-89624241-89624243-C', 'cosmic_id': 'COSM4937'}
# {'count': None, 'gene': 'PTEN', 'variant': 'GRCH37-10-89624243-89624245-A', 'cosmic_id': None}
# {'count': None, 'gene': 'PTEN', 'variant': 'GRCH37-10-89624245-89624262-G', 'cosmic_id': None}
# {'count': None, 'gene': 'PTEN', 'variant': 'GRCH37-10-89624261-89624261-C', 'cosmic_id': None}
# {'count': None, 'gene': 'PTEN', 'variant': 'GRCH37-10-89624262-89624262-CA', 'cosmic_id': None}
# {'count': None, 'gene': 'PTEN', 'variant': 'GRCH37-10-89624262-89624264-C', 'cosmic_id': None}
# {'count': 1, 'gene': 'PTEN', 'variant': 'GRCH37-10-89624263-89624263-T', 'cosmic_id': 'COSM86063'}
# ...

You can add more datasets to the join query. There is no limit to the number of additional joins but it may progressively slow down the query.

For example, add GnomAD allele frequencies to the list of PTEN variants:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
query_C = Dataset.get_by_full_path('solvebio:public:/gnomAD/2.1.1/Exomes-GRCh37').query(fields=["variant", "af"])
join_query = join_query.join(query_C, key='variant')

for record in join_query:
    print(record)

# Output:
# {'count': None, 'af': None, 'gene': 'PTEN', 'variant': 'GRCH37-10-89624227-89624227-G', 'cosmic_id': None}
# {'count': None, 'af': None, 'gene': 'PTEN', 'variant': 'GRCH37-10-89624241-89624241-CA', 'cosmic_id': None}
# {'count': 3, 'af': None, 'gene': 'PTEN', 'variant': 'GRCH37-10-89624241-89624243-C', 'cosmic_id': 'COSM4937'}
# {'count': None, 'af': None, 'gene': 'PTEN', 'variant': 'GRCH37-10-89624243-89624245-A', 'cosmic_id': None}
# {'count': None, 'af': None, 'gene': 'PTEN', 'variant': 'GRCH37-10-89624245-89624262-G', 'cosmic_id': None}
# {'count': None, 'af': None, 'gene': 'PTEN', 'variant': 'GRCH37-10-89624261-89624261-C', 'cosmic_id': None}
# {'count': None, 'af': None, 'gene': 'PTEN', 'variant': 'GRCH37-10-89624262-89624262-CA', 'cosmic_id': None}
# {'count': None, 'af': None, 'gene': 'PTEN', 'variant': 'GRCH37-10-89624262-89624264-C', 'cosmic_id': None}
# {'count': 1, 'af': None, 'gene': 'PTEN', 'variant': 'GRCH37-10-89624263-89624263-T', 'cosmic_id': 'COSM86063'}
# ...

Join queries cannot be filtered

Once a join query is created (i.e. the output of query.join()), you can no longer apply filters on the query. Use Python code to filter the results after receiving them, as shown in the example below.

Filtering the output of join queries must be done client-side (i.e. within your Python code). For example, to find all the COSMIC IDs for the known pathogenic PTEN variants with GnomAD allele frequencies:

1
2
3
4
5
6
for record in join_query:
    if record['af'] is not None:
        print(record)

# Output:
# {'count': None, 'af': 3.97624e-06, 'gene': 'PTEN', 'variant': 'GRCH37-10-89624275-89624277-C', 'cosmic_id': None}

Advanced examples

There is a possibility to join two datasets whose keys are lists. But first, you should apply the explode function before joining.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import solvebio as sb

ds1 = sb.Dataset.retrieve("dataset_id")
ds2 = sb.Dataset.retrieve("dataset_id")

# before joining the records, split each record by the values of ensembl_id
query_A = ds1.query(fields=["variant", "ensembl_id"], annotator_params={"pre_annotation_expression": 'explode(record, fields=["ensembl_id"])'})
query_B = ds2.query(fields=["gene", "ensembl_id"])

join_query = query_A.join(query_B, key="ensembl_id")

for record in join_query:
    print(record)

# To then put this into a new dataset
target = sb.Dataset.get_or_create_by_full_path('~/join_output')
join_query.migrate(target)

Join parameters

The join method has the following attributes:

Parameter Value Description
query_b object Query object query_B that will be joined with the initial query query_A.
key string a key from query_A containing a value that makes a relationship with query_B.
key_b string (Optional, default=None) a key from query_B containing a value from key from query_A. If it is None a key argument from query_A will be used.
prefix string (Optional, default=b_) a prefix that will be added to all filtered fields from query_B. If it is set to None, a random prefix will be used.
always_prefix boolean (Optional, default=False) an option to add a prefix either always or only when it is necessary e.g. when both joining keys have the same name.