Joining Datasets¶

Overview¶

SolveBio provides possibility to join two datasets by keys that are related to each other. The provided datasets join functionality is very similar to well known left join in SQL.

Beta Feature (Python only)

This feature is currently in beta and available in the latest version of the SolveBio Python client (v2.13.0).

filters = Filter(clinical_significance__exact="pathogenic") & Filter(gene__exact="PTEN")
query_A = Dataset.get_by_full_path('solvebio:public:/ClinVar/5.1.0-20200720/Variants-GRCH37').query(filters=filters, fields=['variant', 'gene'])
query_B = Dataset.get_by_full_path('solvebio:public:/COSMIC/1.1.1-COSMIC71/SomaticMutationsCoding-GRCh37').query(fields=['cosmic_id', 'count'])

# A result of datasets joining is a new Query object
join_query = query_A.join(query_B, key='variant', key_b='variant')

for record in join_query:
    print(record)

# Output:
# {'count': None, 'gene': 'PTEN', 'variant': 'GRCH37-10-89624227-89624227-G', 'cosmic_id': None}
# {'count': None, 'gene': 'PTEN', 'variant': 'GRCH37-10-89624241-89624241-CA', 'cosmic_id': None}
# {'count': 3, 'gene': 'PTEN', 'variant': 'GRCH37-10-89624241-89624243-C', 'cosmic_id': 'COSM4937'}
# {'count': None, 'gene': 'PTEN', 'variant': 'GRCH37-10-89624243-89624245-A', 'cosmic_id': None}
# {'count': None, 'gene': 'PTEN', 'variant': 'GRCH37-10-89624245-89624262-G', 'cosmic_id': None}
# {'count': None, 'gene': 'PTEN', 'variant': 'GRCH37-10-89624261-89624261-C', 'cosmic_id': None}
# {'count': None, 'gene': 'PTEN', 'variant': 'GRCH37-10-89624262-89624262-CA', 'cosmic_id': None}
# {'count': None, 'gene': 'PTEN', 'variant': 'GRCH37-10-89624262-89624264-C', 'cosmic_id': None}
# {'count': 1, 'gene': 'PTEN', 'variant': 'GRCH37-10-89624263-89624263-T', 'cosmic_id': 'COSM86063'}
# ...

You can add more datasets to the join query. There is no limit to the number of additional joins but it may progressively slow down the query.

For example, add GnomAD allele frequencies to the list of PTEN variants:

query_C = Dataset.get_by_full_path('solvebio:public:/gnomAD/2.1.1/Exomes-GRCh37').query(fields=["variant", "af"])
join_query = join_query.join(query_C, key='variant')

for record in join_query:
    print(record)

# Output:
# {'count': None, 'af': None, 'gene': 'PTEN', 'variant': 'GRCH37-10-89624227-89624227-G', 'cosmic_id': None}
# {'count': None, 'af': None, 'gene': 'PTEN', 'variant': 'GRCH37-10-89624241-89624241-CA', 'cosmic_id': None}
# {'count': 3, 'af': None, 'gene': 'PTEN', 'variant': 'GRCH37-10-89624241-89624243-C', 'cosmic_id': 'COSM4937'}
# {'count': None, 'af': None, 'gene': 'PTEN', 'variant': 'GRCH37-10-89624243-89624245-A', 'cosmic_id': None}
# {'count': None, 'af': None, 'gene': 'PTEN', 'variant': 'GRCH37-10-89624245-89624262-G', 'cosmic_id': None}
# {'count': None, 'af': None, 'gene': 'PTEN', 'variant': 'GRCH37-10-89624261-89624261-C', 'cosmic_id': None}
# {'count': None, 'af': None, 'gene': 'PTEN', 'variant': 'GRCH37-10-89624262-89624262-CA', 'cosmic_id': None}
# {'count': None, 'af': None, 'gene': 'PTEN', 'variant': 'GRCH37-10-89624262-89624264-C', 'cosmic_id': None}
# {'count': 1, 'af': None, 'gene': 'PTEN', 'variant': 'GRCH37-10-89624263-89624263-T', 'cosmic_id': 'COSM86063'}
# ...

Join queries cannot be filtered

Once a join query is created (i.e. the output of query.join()), you can no longer apply filters on the query. Use Python code to filter the results after receiving them, as shown in the example below.

Filtering the output of join queries must be done client-side (i.e. within your Python code). For example, to find all the COSMIC IDs for the known pathogenic PTEN variants with GnomAD allele frequencies:

for record in join_query:
    if record['af'] is not None:
        print(record)

# Output:
# {'count': None, 'af': 3.97624e-06, 'gene': 'PTEN', 'variant': 'GRCH37-10-89624275-89624277-C', 'cosmic_id': None}

Advanced examples¶

There is a possibility to join two datasets whose keys are lists. But first, you should apply the explode function before joining.

import solvebio as sb

ds1 = sb.Dataset.retrieve("dataset_id")
ds2 = sb.Dataset.retrieve("dataset_id")

# before joining the records, split each record by the values of ensembl_id
query_A = ds1.query(fields=["variant", "ensembl_id"], annotator_params={"pre_annotation_expression": 'explode(record, fields=["ensembl_id"])'})
query_B = ds2.query(fields=["gene", "ensembl_id"])

join_query = query_A.join(query_B, key="ensembl_id")

for record in join_query:
    print(record)

# To then put this into a new dataset
target = sb.Dataset.get_or_create_by_full_path('~/join_output')
join_query.migrate(target)

Join parameters¶

The join method has the following attributes:

Parameter	Value	Description
query_b	object	`Query` object `query_B` that will be joined with the initial query `query_A`.
key	string	a key from `query_A` containing a value that makes a relationship with `query_B`.
key_b	string	(Optional, default=`None`) a key from `query_B` containing a value from `key` from `query_A`. If it is `None` a `key` argument from `query_A` will be used.
prefix	string	(Optional, default=`b_`) a prefix that will be added to all filtered fields from `query_B`. If it is set to `None`, a random prefix will be used.
always_prefix	boolean	(Optional, default=`False`) an option to add a prefix either always or only when it is necessary e.g. when both joining keys have the same name.

Last updated 2022-12-07.

Have questions or comments about this article? Get in touch with SolveBio Support by submitting a ticket or by sending us an email.