Querying Files¶
Overview¶
SolveBio provides an easy-to-use, real-time API for querying files as well as datasets .
File objects can be queried and filtered on one or more fields. The query results are returned in pages. The easiest way to query files is by using Python library. Shortcuts are available for creating complex filters or can also accept raw JSON.
A basic query returns a page of results from the specified file object:
1 2 | clinvar = Object.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37.json.gz') clinvar.query() |
1 2 | clinvar <- Object.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37.json.gz') Object.query(id = clinvar$id) |
You can retrive a specified number of records from the file by setting limit
query parameter.
1 2 | clinvar = Object.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37.json.gz') q = clinvar.query(limit=50) |
1 2 | clinvar <- Object.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37.json.gz') Object.query(id = clinvar$id, limit=50) |
Supported file extensions and compressions¶
The file querying is only supported for the following file extensions and compressions:
File Extensions | Compression |
---|---|
txt | GZIP, BZIP2 |
csv | GZIP, BZIP2 |
tsv | GZIP, BZIP2 |
bed | GZIP, BZIP2 |
json | GZIP, BZIP2 |
parquet | GZIP |
The only supported encoding is UTF-8.
The output format of the query can be provided by using output_format
parameter which can be one of the following:
Output Format | Description |
---|---|
json (default) | applicable to all file extensions |
csv | applicable only to csv, txt, tsv, or bed file extensions |
tsv | applicable only to csv, txt, tsv, or bed file extensions |
Example:
1 2 | clinvar = Object.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37.json.gz') clinvar.query(output_format='json') |
1 2 | clinvar <- Object.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37.json.gz') Object.query(id = clinvar$id, output_format = 'json') |
Filters¶
You can filter on any field in a file.
1 2 | clinvar = Object.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37.json.gz') clinvar.query().filter(clinical_significance='Pathogenic') |
1 2 | clinvar <- Object.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37.json.gz') Object.query(id = clinvar$id, filters = '[["clinical_significance", "Pathogenic"]]') |
File headers
Text files such as CSV, TXT, TSV, or BED must be uploaded with headers, otherwise, the query will return wrong results because the query logic considers the first row from the file as the header.
All fields from the file can be retrieved by calling the fields
method:
1 | fields = Object.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37.json.gz').query().fields() |
1 2 | clinvar <- Object.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37.json.gz') fields <- Object.fields(id = clinvar$id) |
Filters can be modified using "filter actions", which let you adjust the condition of a filter. To specify a filter action, append it to the field name when building a filter: <field>__<action>
(for example, date__gte
for filtering by dates greater or equal to the input).
Action | Description |
---|---|
exact (default) | Field is an exact match to value. Useful for case sensitive string field queries. |
iexact | Field is equal to value (case-insensitive). If the field's value is a list, this will match values in the list, not values within the list. |
in | Field is "one of" a list of values. (similar to Python's in operator). |
range | Field is a number within two values. Ranges are inclusive (fully-closed). |
gt | Field is a number greater than value. |
lt | Field is a number less than value. |
gte | Field is a number greater than or equal to value. |
lte | Field is a number less than or equal to value. |
contains | Field contains this string value. |
Some filter actions (range
, gt
, lt
, gte
, lte
) may only be used on numeric and date fields.
Full-text fields use the contains
filter action by default and act like a typical search would. When using the contains
action on string fields, the system converts the filter in an SQL expression: %{VALUE}%
.
String Filters¶
You may filter string fields using exact (or case-insensitive) match or prefix match (prefix
). You can also match against multiple strings at once (a boolean or) using the in
filter. By default, filters on string fields use the "equals" match.
1 2 3 4 5 6 7 8 9 10 | q = Object.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/ClinVar-GRCh37.json.gz').query() # Equals match q.filter(gene_symbol='BRCA1') # Equals match (in list) q.filter(gene_symbol__in=['BRCA1', 'BRCA2']) # Prefix match q.filter(gene_symbol__prefix='BRCA') |
1 2 3 4 5 6 7 8 9 10 | clinvar <- Object.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37.json.gz') # Equals match Object.query(id = clinvar$id, filters='[["gene_symbol", "BRCA1"]]') # Exact match (in list) Object.query(id = clinvar$id, filters='[["gene_symbol__in", ["BRCA1", "BRCA2"]]]') # Prefix match Object.query(id = clinvar$id, filters='[["gene_symbol__prefix", "BRCA"]]') |
Numeric & Date Filters¶
Numeric and date fields can be filtered by exact match, exact match in list (in
), half-open range match (range
), and standard operators (gt
, lt
, gte
, lte
). Dates are in the format YYYY-MM-DD
.
1 2 3 4 5 6 7 8 9 10 11 12 13 | q = Object.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37.json.gz').query() # Equals match q.filter(rcv_accession_version=4) # Equals match (in list) q.filter(rcv_accession_version__in=[1, 2, 3]) # Range query q.filter(rcv_accession_version__range=[1, 3]) # Operator query (gt/gte/lt/lte) q.filter(rcv_accession_version__gt=4) |
1 2 3 4 5 6 7 8 9 10 11 12 13 | clinvar <- Object.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37.json.gz') # Equals match Object.query(id = clinvar$id, filters='[["rcv_accession_version", 4]]') # Equals match (in list) Object.query(id = clinvar$id, filters='[["rcv_accession_version__in", [1, 2, 3]]]') # Range query Object.query(id = clinvar$id, filters='[["rcv_accession_version__range", [1, 3]]]') # Operator query (gt/gte/lt/lte) Object.query(id = clinvar$id, filters='[["rcv_accession_version__gt", 4]]') |
Combining Filters¶
The examples below show you how to filter a file on one or two fields. In many cases, you will probably need to combine many filters into a single query.
When manually writing queries in JSON, you can combine and nest filters using a boolean operator ('and', 'or', 'not'). In the Python client, you can combine filters using the Filter
("&
" for "and", "|
" for "or", and "~
" for "not").
1 2 3 4 5 6 7 8 9 10 11 12 13 | q = Object.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37.json.gz').query() # AND f = Filter(gene_symbol='BRCA1') & Filter(clinical_significance='Pathogenic') q.filter(f) # OR f = Filter(gene_symbol='BRCA1') | Filter(gene_symbol='BRCA2') q.filter(f) # NOT f = ~ Filter(gene_symbol='BRCA1') q.filter(f) |
1 2 3 4 5 6 7 8 9 10 | clinvar <- Object.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37.json.gz') # AND Object.query(id = clinvar$id, filters='[{"and": [["gene_symbol", "BRCA1"], ["clinical_significance", "Pathogenic"]]}]') # OR Object.query(id = clinvar$id, filters='[{"or": [["gene_symbol", "BRCA1"], ["gene_symbol", "BRCA2"]]}]') # NOT Object.query(id = clinvar$id, filters='[{"not": [["gene_symbol", "BRCA1"], ["gene_symbol", "BRCA2"]]}]') |
Advanced Filters¶
You can also compose filters in JSON and apply these filters via Python:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | # A single filter may contains a single tuple ["{FIELD}", "{VALUE}"] # Or it may be composed of several filters under a boolean operator (and/or) { "{OPERATOR}": [ ["{FIELD}", "{VALUE}"], ["{FIELD}__{ACTION}", "{VALUE}"] ] } # Providing a list of filters assumes the "and" boolean operator: [ ["{FIELD}", "{VALUE}"], ["{FIELD}__{ACTION}", "{VALUE}"] ] # You can also exclude one or more filters using the "not" operator: { "not": ["{FIELD}", "{VALUE1}"] } { "not": { "and": [ ["{FIELD}", "{VALUE1}"], ["{FIELD}", "{VALUE2}"] ] } } # Advanced filters may also be nested in complex ways: { "and": [ ["{FIELD}", "{VALUE}"], { "or": [ {"not": ["{FIELD}", "{VALUE}"]}, ["{FIELD}", "{VALUE}"] ] } ] } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | clinvar = Object.get_by_full_path('solvebio:public:/ClinVar/3.7.4-2017-01-30/Variants-GRCh37.json.gz') # Include all review statuses with "single submitter" filters = [ ["review_status__contains", "single submitter"] ] clinvar.query(filters=filters) # Exclude all review statuses with "single submitter" filters = [{ "not": ["review_status__contains", "single submitter"] }] clinvar.query(filters=filters) # Find review statuses without "single submitter" for allele T or C filters = [{ "and": [ {"not": ["review_status__contains", "single submitter"]}, { "or": [ ["allele", "T"], ["allele", "C"] ] } ] }] clinvar.query(filters=filters) |
Advanced filter syntax is composed of the following elements:
{OPERATOR}
is one ofand
,or
, ornot
.{FIELD}
is a documented field name in the file.{ACTION}
is a valid field action in the format{FIELD}__{ACTION}
(see below).{VALUE}
can be a string, numeric, or list value.
By default, a {FIELD}
with no attached {ACTION}
implies the "case-sensitive equals" (exact
) operator.
String and text actions include:
Action | Description |
---|---|
iexact | (default for string/text) Field is equal to value (case-insensitive). If the field's value is a list, matches the list as a whole, not values within the list. |
exact | Field is an exact match to value. Useful for longer string and text fields. |
in | Field is "one of" a list of values. |
contains | Field contains this string value. |
Numeric and date field actions include:
Action | Description |
---|---|
exact | (default for numeric/date) Field is an exact match to value. |
range | Field is a number within two within two values (inclusive/fully-closed). |
gt | Field is a number greater than value. |
lt | Field is a number less than value. |
gte | Field is a number greater than or equal to value. |
lte | Field is a number less than or equal to value. |