Querying Datasets¶

Overview¶

SolveBio is designed for easy access to molecular information. It provides an easy-to-use, real time API for querying any dataset on the platform.

Dataset query results are returned in pages, similar to a search engine. To narrow down search results, datasets can be filtered on one or more fields. The easiest way to query datasets is by using one of the client libraries. You can either build queries using a programming language (or even writing raw JSON), or you can build them right on any dataset page in the SolveBio web application.

A basic query returns a page of results from the specified dataset:

Python R Bash

Dataset.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37').query()

# You can order query results using the ordering argument

# Order the query results by clinical_significance ascending
Dataset.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37').query(ordering='clinical_significance')

# Order the query results by clinical_significance descending
Dataset.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37').query(ordering='-clinical_significance')

# Query results can be ordered by multiple columns
# Order the query results by clinical_significance descending and gene_symbol ascending
Dataset.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37').query(ordering=['-clinical_significance', 'gene_symbol'])

clinvar <- Dataset.get_by_full_path("solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37")
records <- Dataset.query(id = clinvar$id, limit = 1000, paginate = TRUE)

# You can order query results using the ordering argument

# Order the query results by clinical_significance ascending
records <- Dataset.query(id = clinvar$id, limit = 1000, paginate = TRUE, ordering='clinical_significance')

# Order the query results by clinical_significance descending
records <- Dataset.query(id = clinvar$id, limit=1000, ordering='-clinical_significance')

# Query results can be ordered by multiple columns
# Order the query results by clinical_significance descending and gene_symbol ascending
records <- Dataset.query(id = clinvar$id, limit=1000, ordering=list('-clinical_significance', 'gene_symbol'))

1	curl https://$SOLVEBIO_API_KEY@api.solvebio.com/v2/datasets/510110510237166190/data

Filters¶

You can filter on any field in a dataset.

Python R Bash

1 2	dataset = Dataset.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37') dataset.query().filter(clinical_significance='Pathogenic')

1
2
3

clinvar <- Dataset.get_by_full_path("solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37")
filters <- '[["clinical_significance", "pathogenic"]]'
records <- Dataset.query(id = clinvar$id, filters=filters)

curl https://$SOLVEBIO_API_KEY@api.solvebio.com/v2/datasets/510110510237166190/data \
    -H "Content-Type: application/json" \
    -d '{
            "filters": [
                ["clinical_significance", "Pathogenic"]
            ]
        }'

Filters can be modified using "filter actions", which let you adjust the condition of a filter. To specify a filter action, append it to the field name when building a filter: <field>__<action> (for example, date__gte for filtering by dates greater or equal to the input).

Action	Description
iexact (default)	Field is equal to value (case-insensitive). If the field's value is a list, this will match values within the list, not the list as a whole.
exact	Field is an exact match to value. Useful for case sensitive string field queries.
in	Field is "one of" a list of values. (similar to Python's `in` operator).
range	Field is a number within two values. Ranges are inclusive (fully-closed).
gt	Field is a number greater than value.
lt	Field is a number less than value.
gte	Field is a number greater than or equal to value.
lte	Field is a number less than or equal to value.
contains	Field contains this string value.
regexp	Field value matches this regular expression. (Note: The action is only compatible with datasets)

Some filter actions (range, gt, lt, gte, lte) may only be used on numeric and date fields.

Full-text fields use the contains filter action by default and act like a typical search would. Results are ordered by relevance based on your search terms. When using the contains action on string fields, the system converts the filter in a regular expression: .*{VALUE}.*.

String Filters¶

You may filter string fields using exact (or case-insensitive) match, regular expression match (regexp), or prefix match (prefix). You can also match against multiple strings at once (a boolean or) using the in filter. By default, filters on string fields use the "equals" match.

Python R Bash

q = solvebio.Dataset.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/ClinVar-GRCh37').query()

# Equals match
q.filter(gene_symbol='BRCA1')

# Equals match (in list)
q.filter(gene_symbol__in=['BRCA1', 'BRCA2'])

# Regular expression match
q.filter(gene_symbol__regexp='BRCA[12]')

# Prefix match
q.filter(gene_symbol__prefix='BRCA')

# Equals match
clinvar <- Dataset.get_by_full_path("solvebio:public:/ClinVar/3.7.4-2017-01-30/Combined-GRCh37")

Dataset.query(id = clinvar$id, filters='[["gene_symbol", "BRCA1"]]')

# Equals match (in list)
Dataset.query(id = clinvar$id, filters='[["gene_symbol__in", ["BRCA1", "BRCA2"]]]')

# Regular expression match
Dataset.query(id = clinvar$id, filters='[["gene_symbol__regexp", "BRCA[12]"]]')

# Prefix match
Dataset.query(id = clinvar$id, filters='[["gene_symbol__prefix", "BRCA"]]')

# Equals match
curl https://$SOLVEBIO_API_KEY@api.solvebio.com/v2/datasets/510110510237166190/data \
    -H "Content-Type: application/json" \
    -d '{
            "filters": [
                ["gene_symbol", "BRCA1"]
            ]
        }'

# Exact match (in list)
curl https://$SOLVEBIO_API_KEY@api.solvebio.com/v2/datasets/510110510237166190/data \
    -H "Content-Type: application/json" \
    -d '{
            "filters": [
                ["gene_symbol__in", ["BRCA1", "BRCA2"]]
            ]
        }'

# Regular expression match
curl https://$SOLVEBIO_API_KEY@api.solvebio.com/v2/datasets/510110510237166190/data \
    -H "Content-Type: application/json" \
    -d '{
            "filters": [
                ["gene_symbol__regexp", "BRCA[12]"]
            ]
        }'

# Prefix match
curl https://$SOLVEBIO_API_KEY@api.solvebio.com/v2/datasets/510110510237166190/data \
    -H "Content-Type: application/json" \
    -d '{
            "filters": [
                ["gene_symbol__prefix", "BRCA"]
            ]
        }'

Text Filters¶

Long (paragraph-length) fields typically use the text data type. The contains filter in text fields works more like a search than a filter. Results that match the search term are brought back in the order of relevance.

Python R Bash

q = Dataset.get_by_full_path('solvebio:public:/MEDLINE/2.3.1-2017/MEDLINE').query()

# Contains match for text fields
q.filter(abstract__contains='diabetes')

1
2
3

# Contains match for text fields
MedLine <- Dataset.get_by_full_path("solvebio:public:/MEDLINE/2.3.1-2017/MEDLINE")
Dataset.query(id = MedLine$id, filters='[["abstract__contains", "diabetes"]]')

# Contains match for text fields
curl https://$SOLVEBIO_API_KEY@api.solvebio.com/v2/datasets/510113043291943359/data \
    -H "Content-Type: application/json" \
    -d '{
            "filters": [
                ["abstract__contains", "diabetes"]
            ]
        }'

Numeric & Date Filters¶

Numeric and date fields can be filtered by exact match, exact match in list (in), half-open range match (range), and standard operators (gt, lt, gte, lte). Dates are in the format YYYY-MM-DD.

Python R Bash

q = Dataset.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37').query()

# Equals match
q.filter(rcv_accession_version=4)

# Equals match (in list)
q.filter(rcv_accession_version__in=[1, 2, 3])

# Range query
q.filter(rcv_accession_version__range=[1, 3])

# Operator query (gt/gte/lt/lte)
q.filter(rcv_accession_version__gt=4)

# Equals match
clinvar <- Dataset.get_by_full_path("solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37")

Dataset.query(id = clinvar$id, filters='[["rcv_accession_version", 4]]')

# Equals match (in list)
Dataset.query(id = clinvar$id, filters='[["rcv_accession_version__in", [1, 2, 3]]]')

# Range query
Dataset.query(id = clinvar$id, filters='[["rcv_accession_version__range", [1, 3]]]')

# Operator query (gt/gte/lt/lte)
Dataset.query(id = clinvar$id, filters='[["rcv_accession_version__gt", 4]]')

# Equals match
curl https://$SOLVEBIO_API_KEY@api.solvebio.com/v2/datasets/510110510237166190/data \
    -H "Content-Type: application/json" \
    -d '{
            "filters": [
                ["rcv_accession_version", 4]
            ]
        }'

# Equals match (in list)
curl https://$SOLVEBIO_API_KEY@api.solvebio.com/v2/datasets/510110510237166190/data \
    -H "Content-Type: application/json" \
    -d '{
            "filters": [
                ["rcv_accession_version__in", [1, 2, 3]]
            ]
        }'

# Range query
curl https://$SOLVEBIO_API_KEY@api.solvebio.com/v2/datasets/510110510237166190/data \
    -H "Content-Type: application/json" \
    -d '{
            "filters": [
                ["rcv_accession_version__range", [1, 3]]
            ]
        }'

# Operator query (gt/gte/lt/lte)
curl https://$SOLVEBIO_API_KEY@api.solvebio.com/v2/datasets/510110510237166190/data \
    -H "Content-Type: application/json" \
    -d '{
            "filters": [
                ["rcv_accession_version__gt", 4]
            ]
        }'

Entity Filters¶

Compatibility

The entity filters are only compatible with datasets.

SolveBio-supported Entities can be used for filtering, without requiring the exact field name that the Entity resides in.

Python R

clinvar = Dataset.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Combined-GRCh37')

# Gene entity query
clinvar.query(entities=[['gene', 'BRCA2']])

# Variant entity query
clinvar.query(entities=[['variant', 'GRCH37-13-32890599-32890599-C']])

# Literature entity query
clinvar.query(entities=[['literature', '26096185']])

clinvar <- Dataset.get_by_full_path("solvebio:public:/ClinVar/3.7.4-2017-01-30/Combined-GRCh37")

# Gene entity query
Dataset.query(id = clinvar$id, entities=list(c("gene", "BRCA2")))

# Variant entity query
Dataset.query(id = clinvar$id, entities=list(c("variant", "GRCH37-13-32890599-32890599-C")))

# Literature entity query
Dataset.query(id = clinvar$id, entities=list(c("literature", "26096185")))

Genomic Coordinate Filters¶

Compatibility

The genomic coordinate filters are only compatible with datasets.

A dataset's genomic build is indicated by the suffix of the dataset's full_path.

Python Bash

# GRCh37
q = Dataset.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37').query()

# GRCh38
q = Dataset.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh38').query()

# Position (all overlapping features) - these two queries are equivalent
q.position('chr11', 18313400)
q.position(chromosome='11', position=18313400)

# Exact position only
q.position(chromosome='11', position=17552955, exact=True)

# Range (all overlapping features) - these two queries are equivalent
q.range('chr11', 18313300, 18315000)
q.range(chromosome='11', start=18313300, stop=18315000)

# Exact range only
q.range(chromosome='11', start=18313399, stop=18313403, exact=True)

# Position (all overlapping features)
curl https://$SOLVEBIO_API_KEY@api.solvebio.com/v2/datasets/510110510237166190/data \
    -H "Content-Type: application/json" \
    -d '{
            "genome_build": "GRCh37",
            "filters": [
                {
                    "and": [
                        ["genomic_coordinates.start__lte", 18313400],
                        ["genomic_coordinates.stop__gte", 18313400],
                        ["genomic_coordinates.chromosome", "11"]
                    ]
                }
            ]
        }'

# Exact position only
curl https://$SOLVEBIO_API_KEY@api.solvebio.com/v2/datasets/510110510237166190/data \
    -H "Content-Type: application/json" \
    -d '{
            "genome_build": "GRCh37",
            "filters": [
                {
                    "and": [
                        ["genomic_coordinates.stop", 17552955],
                        ["genomic_coordinates.start", 17552955],
                        ["genomic_coordinates.chromosome", "11"]
                    ]
                }
            ]
        }'

# Range (all overlapping features)
curl https://$SOLVEBIO_API_KEY@api.solvebio.com/v2/datasets/510110510237166190/data \
    -H "Content-Type: application/json" \
    -d '{
            "genome_build": "GRCh37",
            "filters": [
                {
                    "and": [
                        {
                            "or": [
                                {
                                    "and": [
                                        ["genomic_coordinates.start__lte", 18313300],
                                        ["genomic_coordinates.stop__gte", 18315000]
                                    ]
                                },
                                ["genomic_coordinates.start__range", [18313300, 18315000]],
                                ["genomic_coordinates.stop__range", [18313300, 18315000]]
                            ]
                        },
                        ["genomic_coordinates.chromosome", "11"]
                    ]
                }
            ]
        }'

# Exact range only
curl https://$SOLVEBIO_API_KEY@api.solvebio.com/v2/datasets/510110510237166190/data \
    -H "Content-Type: application/json" \
    -d '{
        "genome_build": "GRCh37",
        "filters": [
            {
                "and": [
                    ["genomic_coordinates.stop", 18313403],
                    ["genomic_coordinates.start", 18313399],
                    ["genomic_coordinates.chromosome", "11"]
                ]
            }
        ]
        }'

Combining Filters¶

The examples below show you how to filter a dataset on one or two fields. In many cases, you will probably need to combine many filters into a single query.

When manually writing queries in JSON, you can combine and nest filters using a boolean operator ('and', 'or', 'not'). In the Python client, you can combine filters using the Filter and GenomicFilter classes ("&" for "and", "|" for "or", and "~" for "not").

Python Bash

q = Dataset.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37').query()

# AND
f = solvebio.Filter(gene_symbol='BRCA1') & solvebio.Filter(clinical_significance='Pathogenic')
q.filter(f)

# OR
f = solvebio.Filter(gene_symbol='BRCA1') | solvebio.Filter(gene_symbol='BRCA2')
q.filter(f)

# NOT
f = ~ solvebio.Filter(gene_symbol='BRCA1')
q.filter(f)

# AND
curl https://$SOLVEBIO_API_KEY@api.solvebio.com/v2/datasets/510110510237166190/data \
    -H "Content-Type: application/json" \
    -d '{
            "filters": [
                {
                    "and": [
                        ["gene_symbol", "BRCA1"],
                        ["clinical_significance", "Pathogenic"]
                    ]
                }
            ]
        }'

# OR
curl https://$SOLVEBIO_API_KEY@api.solvebio.com/v2/datasets/510110510237166190/data \
    -H "Content-Type: application/json" \
    -d '{
            "filters": [
                {
                    "or": [
                        ["gene_symbol", "BRCA1"],
                        ["gene_symbol", "BRCA2"]
                    ]
                }
            ]
        }'

# NOT
curl https://$SOLVEBIO_API_KEY@api.solvebio.com/v2/datasets/510110510237166190/data \
  -H "Content-Type: application/json" \
  -d '{
            "filters": [
                {
                    "not": ["gene_symbol", "BRCA1"]
                }
            ]
        }'

Query Strings¶

Compatibility

The query strings are only compatible with datasets.

Query strings are parsed into a series of terms and operators. A query string can be provided as part of a SolveBio dataset query in combination with filters, or as an alternative to filters. Terms in a query string can be single words - "quick" or "brown" - or a phrase surrounded by double quotes - "quick brown" - which will search for all the words in the phrase, in the same order. The query syntax is based on the Lucene query syntax.

Queries and Filters

Queries are great to find records that best match a word or phrase, relative to others. Filters are designed to reduce the potential result-set by asking yes/no questions on every record in a dataset.

Query string operators allow you to customize a search. The available options are explained below:

Field Names

By default, when no field names are specified, all string or text fields are searched for each term. You can provide an explicit field name if you know the field in question:

status:active

You can do an exact match on a specific field, for example:

author:"John Smith"

To search for one-or-more terms in a field, combine them with OR (default) or AND:

title:(quick brown)
title:(quick OR brown)
title:(quick AND brown)

You can also find records with missing fields:

_missing_:title

Or, records where the field has a value (i.e. "not missing"):

_exists_:title

Wildcards

Wildcard searches can be run on individual terms, using ? to replace a single character, and * to replace zero or more characters:

qu?ck bro*

Ranges

Ranges can be specified for almost any field data type. They are most useful for dates and numeric fields. Inclusive ranges are specified with square brackets [min TO max] and exclusive ranges with curly brackets {min TO max}.

For example, this query will retrieve records for all days in 2012:

date:[2012-01-01 TO 2012-12-31]

Similarly, you can also use ranges on numeric fields:

count:[1 TO 5]

And use infinite ranges:

count:[10 TO *]

Standard numeric comparison operators can also be used:

age:>10
age:>=10
age:<10
age:<=10

To combine an upper and lower bound with the simplified syntax, you would need to join two clauses with an AND operator:

age:(>=10 AND <20)
age:(+>=10 +<20)

Always try range filters first

Processing ranges from a query string is much slower and less reliable than using an explicit range filter.

Grouping

Multiple terms or clauses can be grouped together with parentheses to form sub-queries:

(quick OR brown) AND fox

Groups can be used to target a particular field, or to boost the result of a sub-query:

status:(active OR pending) title:(full text search)^2

Reserved characters

The following characters are reserved in query strings and must be escaped with a leading backslash when used as part of a query term:

+ - = && || > < ! ( ) { } [ ] ^ " ~ * ? : \ /

For example, to search for the string (1+1)=2, your query should be $1\+1$\=2.

Advanced Filters¶

You can also compose filters in JSON and apply these filters via R, Python, or the SolveBio UI:

Json Python R

# A single filter may contains a single tuple
["{FIELD}", "{VALUE}"]

# Or it may be composed of several filters under a boolean operator (and/or)
{
    "{OPERATOR}": [
        ["{FIELD}", "{VALUE}"],
        ["{FIELD}__{ACTION}", "{VALUE}"]
    ]
}

# Providing a list of filters assumes the "and" boolean operator:
[
    ["{FIELD}", "{VALUE}"],
    ["{FIELD}__{ACTION}", "{VALUE}"]
]

# You can also exclude one or more filters using the "not" operator:
{
    "not": ["{FIELD}", "{VALUE1}"]
}

{
    "not": {
        "and": [
            ["{FIELD}", "{VALUE1}"],
            ["{FIELD}", "{VALUE2}"]
        ]
    }
}

# Advanced filters may also be nested in complex ways:
{
    "and": [
        ["{FIELD}", "{VALUE}"],
        {
            "or": [
                {"not": ["{FIELD}", "{VALUE}"]},
                ["{FIELD}", "{VALUE}"]
            ]
        }
    ]
}

dataset = Dataset.get_by_full_path('solvebio:public:/MEDLINE/2.3.1-2017/MEDLINE')

# Include all abstracts with "diabetes"
filters = [
    ["abstract__contains", "diabetes"]
]
dataset.query(filters=filters)

# Exclude all abstracts with "diabetes"
filters = [{
    "not": ["abstract__contains", "diabetes"]
}]
dataset.query(filters=filters)

# Find abstracts without "diabetes" from 1977
filters = [{
    "and": [
        {"not": ["abstract__contains", "diabetes"]},
        {
            "or": [
                ["date_published__regex", "*1977*"],
                ["date_created__range", ["1977-01-01", "1977-12-31"]]
            ]
        }
    ]
}]
dataset.query(filters=filters)

# Contains match for text fields
dataset <- Dataset.get_by_full_path("solvebio:public:/MEDLINE/2.3.1-2017/MEDLINE")

# Include all abstracts with "diabetes"
filters <- '[
    ["abstract__contains", "diabetes"]
]'
Dataset.query(id = dataset$id, filters = filters)

# Exclude all abstracts with "diabetes"
filters = '[{
    "not": ["abstract__contains", "diabetes"]
}]'
Dataset.query(id = dataset$id, filters = filters)

# Find abstracts without "diabetes" from 1977
filters = '[{
    "and": [
        {"not": ["abstract__contains", "diabetes"]},
        {
            "or": [
                ["date_published__regex", "*1977*"],
                ["date_created__range", ["1977-01-01", "1977-12-31"]]
            ]
        }
    ]
}]'
Dataset.query(id = dataset$id, filters = filters)

Advanced filter syntax is composed of the following elements:

{OPERATOR} is one of and, or, or not.
{FIELD} is a documented field name in the dataset.
{ACTION} is a valid field action in the format {FIELD}__{ACTION} (see below).
{VALUE} can be a string, numeric, or list value.

By default, a {FIELD} with no attached {ACTION} implies the "case-insensitive equals" (iexact) operator.

String and text actions include:

Action	Description
iexact	(default for string/text) Field is equal to value (case-insensitive). If the field's value is a list, matches a value within the list, not the list as a whole.
exact	Field is an exact match to value. Useful for longer string and text fields.
in	Field is "one of" a list of values.
contains	Field contains this string value.
regexp	Field value matches this regular expression. (Note: The action is only compatible with datasets)

Numeric and date field actions include:

Action	Description
exact	(default for numeric/date) Field is an exact match to value.
range	Field is a number within two within two values (inclusive/fully-closed).
gt	Field is a number greater than value.
lt	Field is a number less than value.
gte	Field is a number greater than or equal to value.
lte	Field is a number less than or equal to value.

Full-text (text data type) fields automatically use the contains filter action instead.

Last updated 2022-12-07.

Have questions or comments about this article? Get in touch with SolveBio Support by submitting a ticket or by sending us an email.