Skip to content

Querying Files

Overview

SolveBio provides an easy-to-use, real-time API for querying files as well as datasets .

File objects can be queried and filtered on one or more fields. The query results are returned in pages. The easiest way to query files is by using Python library. Shortcuts are available for creating complex filters or can also accept raw JSON.

A basic query returns a page of results from the specified file object:

1
2
clinvar = Object.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37.json.gz')
clinvar.query()
1
2
clinvar <- Object.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37.json.gz')
Object.query(id = clinvar$id)

You can retrive a specified number of records from the file by setting limit query parameter.

1
2
clinvar = Object.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37.json.gz')
q = clinvar.query(limit=50)
1
2
clinvar <- Object.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37.json.gz')
Object.query(id = clinvar$id, limit=50)

Supported file extensions and compressions

The file querying is only supported for the following file extensions and compressions:

File Extensions Compression
txt GZIP, BZIP2
csv GZIP, BZIP2
tsv GZIP, BZIP2
bed GZIP, BZIP2
json GZIP, BZIP2
parquet GZIP

The only supported encoding is UTF-8.

The output format of the query can be provided by using output_format parameter which can be one of the following:

Output Format Description
json (default) applicable to all file extensions
csv applicable only to csv, txt, tsv, or bed file extensions
tsv applicable only to csv, txt, tsv, or bed file extensions

Example:

1
2
clinvar = Object.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37.json.gz')
clinvar.query(output_format='json')
1
2
clinvar <- Object.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37.json.gz')
Object.query(id = clinvar$id, output_format = 'json')

Filters

You can filter on any field in a file.

1
2
clinvar = Object.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37.json.gz')
clinvar.query().filter(clinical_significance='Pathogenic')
1
2
clinvar <- Object.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37.json.gz')
Object.query(id = clinvar$id, filters = '[["clinical_significance", "Pathogenic"]]')

File headers

Text files such as CSV, TXT, TSV, or BED must be uploaded with headers, otherwise, the query will return wrong results because the query logic considers the first row from the file as the header.

All fields from the file can be retrieved by calling the fields method:

1
fields = Object.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37.json.gz').query().fields()
1
2
clinvar <- Object.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37.json.gz')
fields <- Object.fields(id = clinvar$id)

Filters can be modified using "filter actions", which let you adjust the condition of a filter. To specify a filter action, append it to the field name when building a filter: <field>__<action> (for example, date__gte for filtering by dates greater or equal to the input).

Action Description
exact (default) Field is an exact match to value. Useful for case sensitive string field queries.
iexact Field is equal to value (case-insensitive). If the field's value is a list, this will match values in the list, not values within the list.
in Field is "one of" a list of values. (similar to Python's in operator).
range Field is a number within two values. Ranges are inclusive (fully-closed).
gt Field is a number greater than value.
lt Field is a number less than value.
gte Field is a number greater than or equal to value.
lte Field is a number less than or equal to value.
contains Field contains this string value.

Some filter actions (range, gt, lt, gte, lte) may only be used on numeric and date fields.

Full-text fields use the contains filter action by default and act like a typical search would. When using the contains action on string fields, the system converts the filter in an SQL expression: %{VALUE}%.

String Filters

You may filter string fields using exact (or case-insensitive) match or prefix match (prefix). You can also match against multiple strings at once (a boolean or) using the in filter. By default, filters on string fields use the "equals" match.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
q = Object.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/ClinVar-GRCh37.json.gz').query()

# Equals match
q.filter(gene_symbol='BRCA1')

# Equals match (in list)
q.filter(gene_symbol__in=['BRCA1', 'BRCA2'])

# Prefix match
q.filter(gene_symbol__prefix='BRCA')
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
clinvar <- Object.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37.json.gz')

# Equals match
Object.query(id = clinvar$id, filters='[["gene_symbol", "BRCA1"]]')

# Exact match (in list)
Object.query(id = clinvar$id, filters='[["gene_symbol__in", ["BRCA1", "BRCA2"]]]')

# Prefix match
Object.query(id = clinvar$id, filters='[["gene_symbol__prefix", "BRCA"]]')

Numeric & Date Filters

Numeric and date fields can be filtered by exact match, exact match in list (in), half-open range match (range), and standard operators (gt, lt, gte, lte). Dates are in the format YYYY-MM-DD.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
q = Object.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37.json.gz').query()

# Equals match
q.filter(rcv_accession_version=4)

# Equals match (in list)
q.filter(rcv_accession_version__in=[1, 2, 3])

# Range query
q.filter(rcv_accession_version__range=[1, 3])

# Operator query (gt/gte/lt/lte)
q.filter(rcv_accession_version__gt=4)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
clinvar <- Object.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37.json.gz')

# Equals match
Object.query(id = clinvar$id, filters='[["rcv_accession_version", 4]]')

# Equals match (in list)
Object.query(id = clinvar$id, filters='[["rcv_accession_version__in", [1, 2, 3]]]')

# Range query
Object.query(id = clinvar$id, filters='[["rcv_accession_version__range", [1, 3]]]')

# Operator query (gt/gte/lt/lte)
Object.query(id = clinvar$id, filters='[["rcv_accession_version__gt", 4]]')

Combining Filters

The examples below show you how to filter a file on one or two fields. In many cases, you will probably need to combine many filters into a single query.

When manually writing queries in JSON, you can combine and nest filters using a boolean operator ('and', 'or', 'not'). In the Python client, you can combine filters using the Filter ("&" for "and", "|" for "or", and "~" for "not").

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
q = Object.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37.json.gz').query()

# AND
f = Filter(gene_symbol='BRCA1') & Filter(clinical_significance='Pathogenic')
q.filter(f)

# OR
f = Filter(gene_symbol='BRCA1') | Filter(gene_symbol='BRCA2')
q.filter(f)

# NOT
f = ~ Filter(gene_symbol='BRCA1')
q.filter(f)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
clinvar <- Object.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37.json.gz')

# AND
Object.query(id = clinvar$id, filters='[{"and": [["gene_symbol", "BRCA1"], ["clinical_significance", "Pathogenic"]]}]')

# OR
Object.query(id = clinvar$id, filters='[{"or": [["gene_symbol", "BRCA1"], ["gene_symbol", "BRCA2"]]}]')

# NOT
Object.query(id = clinvar$id, filters='[{"not": [["gene_symbol", "BRCA1"], ["gene_symbol", "BRCA2"]]}]')

Advanced Filters

You can also compose filters in JSON and apply these filters via Python:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# A single filter may contains a single tuple
["{FIELD}", "{VALUE}"]

# Or it may be composed of several filters under a boolean operator (and/or)
{
    "{OPERATOR}": [
        ["{FIELD}", "{VALUE}"],
        ["{FIELD}__{ACTION}", "{VALUE}"]
    ]
}

# Providing a list of filters assumes the "and" boolean operator:
[
    ["{FIELD}", "{VALUE}"],
    ["{FIELD}__{ACTION}", "{VALUE}"]
]

# You can also exclude one or more filters using the "not" operator:
{
    "not": ["{FIELD}", "{VALUE1}"]
}

{
    "not": {
        "and": [
            ["{FIELD}", "{VALUE1}"],
            ["{FIELD}", "{VALUE2}"]
        ]
    }
}

# Advanced filters may also be nested in complex ways:
{
    "and": [
        ["{FIELD}", "{VALUE}"],
        {
            "or": [
                {"not": ["{FIELD}", "{VALUE}"]},
                ["{FIELD}", "{VALUE}"]
            ]
        }
    ]
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
clinvar = Object.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37.json.gz')

# Include all review statuses with "single submitter"
filters = [
    ["review_status__contains", "single submitter"]
]
clinvar.query(filters=filters)

# Exclude all review statuses with "single submitter"
filters = [{
    "not": ["review_status__contains", "single submitter"]
}]
clinvar.query(filters=filters)

# Find review statuses without "single submitter" for allele T or C
filters = [{
    "and": [
        {"not": ["review_status__contains", "single submitter"]},
        {
            "or": [
                ["allele", "T"],
                ["allele", "C"]
            ]
        }
    ]
}]
clinvar.query(filters=filters)

Advanced filter syntax is composed of the following elements:

  • {OPERATOR} is one of and, or, or not.
  • {FIELD} is a documented field name in the file.
  • {ACTION} is a valid field action in the format {FIELD}__{ACTION} (see below).
  • {VALUE} can be a string, numeric, or list value.

By default, a {FIELD} with no attached {ACTION} implies the "case-sensitive equals" (exact) operator.

String and text actions include:

Action Description
iexact (default for string/text) Field is equal to value (case-insensitive). If the field's value is a list, matches the list as a whole, not values within the list.
exact Field is an exact match to value. Useful for longer string and text fields.
in Field is "one of" a list of values.
contains Field contains this string value.

Numeric and date field actions include:

Action Description
exact (default for numeric/date) Field is an exact match to value.
range Field is a number within two within two values (inclusive/fully-closed).
gt Field is a number greater than value.
lt Field is a number less than value.
gte Field is a number greater than or equal to value.
lte Field is a number less than or equal to value.